Perl multithreading slower when memory usage getting high - multithreading

Hi all~ I has written a very simple code in Perl using multithreading. The codes are as follows.
#!/bin/perl
use strict;
use threads;
use Benchmark qw(:hireswallclock);
my $starttime;
my $finishtime;
my $timespent;
my $num_of_threads = 1;
my $total_size = 10000000;
my $chunk_size = int($total_size / $num_of_threads);
if($total_size % $num_of_threads){
$chunk_size++;
}
my #threads = ();
$starttime = Benchmark->new;
for(my $i = 0; $i < $num_of_threads; $i++) {
my $thread = threads->new(\&search);
push (#threads, $thread);
}
foreach my $thread (#threads) {
$thread->join();
}
my $finishtime = Benchmark->new;
$timespent = timediff($finishtime, $starttime);
print "$num_of_threads threads used in ".timestr($timespent)."\nDone!\n";
sub search{
my $i = 0;
while($i < $chunk_size){
$i++;
}
return 1;
}
This piece of codes works fine as when increasing the number of threads, it will run faster.
However, when adding additional lines in the middle, which will create an array in big size, the code will run more slowly when adding more threads. The codes with the additional lines are shown below.
#!/bin/perl
use strict;
use threads;
use Benchmark qw(:hireswallclock);
my $starttime;
my $finishtime;
my $timespent;
my $num_of_threads = 1;
my $total_size = 10000000;
my $chunk_size = int($total_size / $num_of_threads);
if($total_size % $num_of_threads){
$chunk_size++;
}
##########Additional codes##########
print "Preparing data...\n";
$starttime = Benchmark->new;
my #array = ();
for(my $i = 0; $i < $total_size; $i++){
my $rn = rand();
push(#array, $rn);
}
$finishtime = Benchmark->new;
$timespent = timediff($finishtime, $starttime);
print "Used ".timestr($timespent)."\n";
######################################
my #threads = ();
$starttime = Benchmark->new;
for(my $i = 0; $i < $num_of_threads; $i++) {
my $thread = threads->new(\&search);
push (#threads, $thread);
}
foreach my $thread (#threads) {
$thread->join();
}
my $finishtime = Benchmark->new;
$timespent = timediff($finishtime, $starttime);
print "$num_of_threads threads used in ".timestr($timespent)."\nDone!\n";
sub search{
my $i = 0;
while($i < $chunk_size){
$i++;
}
return 1;
}
I am so confused regarding such a behaviour in Perl's multithreading. Does anyone has any idea what may go wrong here?
Thanks!

You have to remember that when using ithreads - interpreter threads - the entire Perl interpreter, including code and memory, is cloned into the new thread. So the more data to be cloned, the longer it takes. There are ways to control what is cloned; take a look at the threads perldoc.
You should do as little as possible and not even load many modules until after you have spawned your threads.
If you do have a lot of data that will be used by all the threads, make it shared with threads::shared. Then to share a data structure use shared_clone(). You cannot simply share() anything, other than a simple variable. That shared variable can only contain plain scalars or other shared references.
If you are going to consume or pump that data, make it a queue instead with the Thread::Queue module. It automatically shares the values and takes care of the locking. After spawning your pool of worker threads, control them with Thread::Semaphore. That way they won't terminate before you've given them anything to do. You can also prevent race conditions.
https://metacpan.org/pod/threads::shared
HTH

Thank you all for pointing me to the relative directions! I have learned and tried different things, inlucding how to use shared and queue, which should be able to solve the problem. So I revised the script as follows:
#!/bin/perl
use strict;
use threads;
use threads::shared;
use Thread::Queue;
use Benchmark qw(:hireswallclock);
my $starttime;
my $finishtime;
my $timespent;
my $num_of_threads = shift #ARGV;
my $total_size = 100000;
######Initiation of a 2D queue######
print "Preparing queue...\n";
$starttime = Benchmark->new;
my $queue = Thread::Queue->new();
for(my $i = 0; $i < $total_size; $i++){
my $rn1 = rand();
my $rn2 = rand();
my #interval :shared = sort($rn1, $rn2);
$queue->enqueue(\#interval);
}
$finishtime = Benchmark->new;
$timespent = timediff($finishtime, $starttime);
print "Used ".timestr($timespent)."\n";
#####################################
$starttime = Benchmark->new;
my $queue_copy = $queue; #Copy the 2D queue so that the original queue can be kept\
for(my $i = 0; $i < $num_of_threads; $i++) {
my $thread = threads->create(\&search, $queue_copy);
}
foreach my $thread (threads->list()) {
$thread->join();
}
$finishtime = Benchmark->new;
$timespent = timediff($finishtime, $starttime);
print "$num_of_threads threads used in ".timestr($timespent)."\nDone!\n";
#####################################
sub search{
my $temp_queue = $_[0];
while(my $temp_interval = $temp_queue->dequeue_nb()){
#Do something
}
return 1;
}
What I was trying to do was , first, to make a queue of arrays, with each containing two numbers. A copy of the queue was made as I want to preserve the original queue when going through it. Then the copied queue was gone through using multithreads. However, I still found out that it ran slower when more threads were added, which I have no clue why.

Related

perl multithreading perl exited with active threads

I write perl script to do a certain but I needed to mutlithread it, but I get this:
Perl exited with active threads:
2 running and unjoined
0 finished and unjoined
0 running and detached
Here is the code:
use Net::Ping;
use threads;
use Benchmark qw(:hireswallclock);
use threads::shared;
my $starttime = Benchmark->new;
my $finishtime;
my $timespent;
my $num_of_threads = 2;
my #threads = initThreads();
my $inFile = $ARGV[0] ;
open(IN , "<$inFile") or die "can not find $inFile" ;
my #output: shared = <IN>;
chomp (#output) ;
my $chunk_size = #output / 2;
print($chunk_size);
#############################
######## PROCEDURES ########
#############################
# Subroutine that intializes an array that will contain all our threads:
sub initThreads{
my #initThreads; # Our array
for(my $i=1; $i<=$num_of_threads; $i++){
push(#initThreads, $i);
}
return #initThreads;
}
sub doScript{
my $id = threads->tid();
print "//////////////////////////////////////////////////////////////////////////////////////////////Starting thread $id\n";
my ($start, $end, $output) = #_;
for my $i ($start .. $end) {
## some stuff done
sleep 1 if 0.2 > rand;
}
print "/////////////////////////////////////////////////////////////////////////////////////////////////////////////////Thread $id done!\n";
threads->exit();
}
########################
######## MAIN ########----------------------------------------------------------------
########################
for my $chunk(1 .. 2){
my $start = ($chunk - 1) * $chunk_size;
push #threads, threads->create(
\&doScript,
$start,
($start + $chunk_size - 1),
\#output,
);
print("finish");
}
# This tells the main program to keep running until all threads have finished.
foreach(#threads){
threads->join();
}
$finishtime = Benchmark->new;
$timespent = timediff($finishtime,$starttime);
print "\nDone!\nSpent ". timestr($timespent);
#print "\nProgram Done!\nPress Enter to exit\n";
$a = <>;
close (IN);
I even replaced the join block with this:
$_->join() for threads->list();
the error was solved but the script seemed not to do anything, threads started and terminated without doing anything.
Can anyone help me with this !
The reason you get that error is exactly as said on the tin - your code exited before the threads were closed. It can commonly happen when an exit or die is triggered early.
In particular I think your problem may lie within your reusing of #threads and initThreads().
The latter returns a list of numbers, not any threads. And then you push a couple more threads to the end of the list later, when you're doing a create.
That looks a lot like some sort of logic error to me.
But the major problem I think will be this:
foreach(#threads){
threads->join();
}
You're not actually joining a specific thread. What you probably want is:
foreach my $thr ( #threads ) {
$thr -> join();
}
At least, you would, if you hadn't manually populated #threads with [1,2] thanks to my #threads = initThreads();

Missing characters while reading input with threads

Let's say we have a script which open a file, then read it line by line and print the line to the terminal. We have a sigle thread and a multithread version.
The problem is than the resulting output of both scripts is almost the same, but not exactly. In the multithread versions there are about ten lines which missed the first 2 chars. I mean, if the real line is something line "Stackoverflow rocks", I obtain "ackoverflow rocks".
I think that this is related to some race condition since if I adjust the parameters to create a lot of little workers, I get more faults than If I use less and bigger workers.
The single thread is like this:
$file = "some/file.txt";
open (INPUT, $file) or die "Error: $!\n";
while ($line = <STDIN>) {
print $line;
}
The multithread version make's use of the thread queue and this implementation is based on the #ikegami approach:
use threads qw( async );
use Thread::Queue 3.01 qw( );
use constant NUM_WORKERS => 4;
use constant WORK_UNIT_SIZE => 100000;
sub worker {
my ($job) = #_;
for (#$job) {
print $_;
}
}
my $q = Thread::Queue->new();
async { while (defined( my $job = $q->dequeue() )) { worker($job); } }
for 1..NUM_WORKERS;
my $done = 0;
while (!$done) {
my #lines;
while (#lines < WORK_UNIT_SIZE) {
my $line = <>;
if (!defined($line)) {
$done = 1;
last;
}
push #lines, $line;
}
$q->enqueue(\#lines) if #lines;
}
$q->end();
$_->join for threads->list;
I tried your program and got similar (wrong) results. Instead of Thread::Semaphore I used lock from threads::shared around the print as it's simpler to use than T::S, i.e.:
use threads;
use threads::shared;
...
my $mtx : shared;
sub worker
{
my ($job) = #_;
for (#$job) {
lock($mtx); # (b)locks
print $_;
# autom. unlocked here
}
}
...
The global variable $mtx serves as a mutex. Its value doesn't matter, even undef (like here) is ok.
The call to lock blocks and returns only if no other threads currently holds the lock on that variable.
It automatically unlocks (and thus makes lock return) when it goes out of scope. In this sample that happens
after every single iteration of the for loop; there's no need for an extra {…} block.
Now we have syncronized the print calls…
But this didn't work either, because print does buffered I/O (well, only O). So I forced unbuffered output:
use threads;
use threads::shared;
...
my $mtx : shared;
$| = 1; # force unbuffered output
sub worker
{
# as above
}
...
and then it worked. To my surprise I could then remove the lock and it still worked. Perhaps by accident. Note that your script will run significantly slower without buffering.
My conclusion is: you're suffering from buffering.

how to convert a nested for loop into multithreading program in perl

I need help with converting a nested for loop into multthreading program in Perl, e.g.
for ( my $i=0; $i<100; $i++) {
for ( my $j=0; $j<100; $j++ ) {
for ( my $k=0; $k<100; $k++ ) {
#do something ....
}
}
}
Is there a way where i can split the first loop as below and run them in parallel
#Job1:
for ( my $i=0; $i < 40; $i++) {
for( my $j=0; $j < 100; $j++) {
for( my $k=0; $k < 100; $k++) {
#do something ....
}
}
}
#Job2:
for ( my $i=40; $i < 80; $i++) {
for( my $j=0; $j<100; $j++) {
for( my $k=0; $k<100; $k++) {
#do something ....
}
}
}
#Job3
for ( my $i=80; $i < 100; $i++) {
for( my $j=0; $j < 100; $j++) {
for( my $k=0; $k < 100; $k++) {
#do something ....
}
}
}
How can I run each program in parallel and then exit the main program only when all the sub program Job1,Job2 and job3 are complete.
I'll offer a reference to a similar answer I've used before - they key question is - are your jobs completely decoupled? E.g. no data needs to move between them?
If so, use Parallel::ForkManager it goes a bit like this:
use Parallel::ForkManager;
my $fork_manager = Parallel::ForkManager -> new ( 10 ); #10 in parallel
for ( my $i=0;$i<100;$i++) {
#in parallel:
$fork_manager -> start and next;
for ( my $j=0; $j < 100; $j++) {
for ( my $k=0; $k < 100; $k++) {
#do something ....
}
}
$fork_manager -> finish;
}
$fork_manager -> wait_all_children();
This will, for each iteration of $i fork the code and run in parallel - and ForkManager will cap the concurrency at 10.
This number should be approximately comparable to the limiting factor in your parallelism - if it's CPU, then number of CPUs, but bear in mind that you're often more constrained by disk IO.
Key caveats when doing parallelism:
You can't guarantee execution sequence without messing around. It's entirely possible that loop $i==1 finishes after loop $i==2. Or before. Or whatever.
If you're passing information between your loops, parallel loses efficiency - because the sender and receiver each need to synchronise. It's even worse if you need to synchronise the whole lot, so try to avoid doing that more than necessary. (e.g. wherever possible, leave it until the end and collate the results).
That goes double for forked code - they're separate processes, so you actually have to try to transfer things back and forth.
You can get some really very fruity bugs from parallel code, because of that first point. Individual lines of code may occur in any order, so very strange things can happen. Each process will sequence, but multiple may well interleave. Something innocuous like open ( my $file, ">>", $output_filename ); can trip you up.
forking is quite limited in it's ability to share data between forks. If you need to do much of this, consider threading instead.
Threading is an alternative model of concurrency, that can be valuable in certain circumstance. I'm generally leaning towards forking being generally 'better', but in places where I'm wanting to do a fair bit of inter-process communication, I'd be tending to look more towards threads.
Perl daemonize with child daemons

Keeping count with threads in perl

Im trying to count whenever a thread is done in perl, and print the count. but this is not working. i keep getting either "0" or "1", im trying to add to the count then print the count right after the get request is made.
use strict;
use threads;
use LWP::UserAgent;
our $MAX //= $ARGV[1];
my $list = $ARGV[0];
open my $handle, '<', $list;
chomp(my #array = <$handle>);
close $handle;
my $lines = `cat $list | wc -l`;
my $count = 0;
my #threads;
foreach $_ (#array) {
push #threads, async{
my #chars = ("a".."z");
my $random = join '', map { #chars[rand #chars] } 1 .. 6;
my $ua = LWP::UserAgent->new;
my $url = $_ . '?session=' . $random;
my $response = $ua->get($url);
count++;
print $count;
};
sleep 1 while threads->list( threads::running ) > $MAX;
}
$_->join for #threads;
Just to summarise points in comments by #choroba and myself, and not leave the question without an answer.
You would need to include:
use threads::shared;
in your code, along with all the other use elements.
And to indicate that variable $count is shared:
my $count :shared = 0;
EDIT As per Ikegami's comment, you would have to lock the variable if you want to modify it, to avoid problems of concurrency.
{
lock($count);
$count++;
print $count;
}
And that should be enough for the variable $count to be shared.

Perl: write value in thread

I am trying to get text of two large files. To speed it up i tried threads.
Before i used threads the script worked, now it does not.
The problem is: I save everything I read in the file into a hash.
When i print out the size (or keys/values) after the read-in in the sub (which the thread executed) it shows a correct number > 0, when i print out the size of the hash anywhere else (after the threads have run) it shows me 0.
print ": ".keys(%c);
is used 2 times, and has different output each time.
(In the final programm 2 Threads are running and a method to compare the stuff is called after the threads finished)
Example code:
my %c;
my #threads = initThreads();
#threads[0] = threads->create(\&ce);
foreach(#threads){
$_->join();
}
print ": ".keys(%c);
sub initThreads{
my #initThreads;
for(my $i = 0; $i<2;$i++){
push(#initThreads, $i);
}
return #initThreads;
}
sub ce(){
my $id = threads->tid();
open my $file, "<", #arg1[1] or die $!;
my #cXY;
my #cDa;
while(my $line = <$file>){
# some regex and push to arrays, works
#c{#cXY} = #cDa;
}
print "Thread $id is done\n";
close $file;
print ": ".keys(%c);
threads->exit();
}
Do i have to run the things after the first 2 threads finished in another thread which waits until the first two are finished?
Or what am i doing wrong with threads?
Thanks.
%c isn't shared across your threads.
use threads;
use threads::shared
my %c :shared;
See threads::shared.
In Perl, threads don't share memory. Each thread operates on a different copy of %c, so the changes aren't reflected to the parent thread. While sharing a variable across threads is possible, this is not generally advisable.
Make use of the possibility to return data from a thread. E.g
my %c = map %{ $_->join }, #threads; # flatten all returned hashes
sub ce {
my %hash;
...;
return \%hash;
}
Some other suggestions:
use strict; use warnings; if you aren't already.
use better variable names.
you only seem to be spawning one thread (in $threads[0]).
my #array; for (my $i = 0; $i < 2; $i++){ push(#array, $i) } is equivalent to my #array = 0 .. 1.
#arg1 is not declared in the current scope.
manually exiting a thread is not neccessary in your case.

Resources