I have a big code that somewhere in the middle of it, there's a foreach loop filling up some dictionaries. Each iteration is independant and fills those dictionaries with disjoint keys.
I'm trying to turn the "foreach" loop in the middle to multithreaded in order to decrease time.
In the following example, $a1, $b1 are the pointers to dictionaries.
I tried "thread::shared" this way:
my $a1 = {};
my $b1 = {};
my $c1 = {};
my $d1 = {};
# a lot of code using $a1 and $b1
share($a1);
share($b1);
share($c1);
share($d1);
my #threads;
foreach my $Inst ( sort keys %{ $a1->{ports} }) {
push( #threads, threads->create('some_func', $Inst, $a1, $b1, $c1, $d1, $e ...));
}
for my $thr (#threads) {
thr->join();
}
# all the other code
But I get an error of:
Invalid value for shared scalar at ...
Any ideas how to get the data-structures filled, but not that it would interfere with the code before and after the for-each loop?
It is not possible to make a hash shared after it has been created/declared without losing the data in the hash. Instead you could try use shared_clone() like this:
use feature qw(say);
use strict;
use warnings;
use threads ;
use threads::shared ;
use Data::Dumper qw(Dumper);
my %h1 = (a => 1, b => 2);
my %h2 = (c => 3, d => 4);
my $a1 = \%h1;
my $b1 = \%h2;
my $a1c = shared_clone($a1);
my $b1c = shared_clone($b1);
my $lockvar:shared;
my $nthreads = 3;
for ( 1..$nthreads ) {
threads->create('job_to_parallelize', $a1c, $b1c, \$lockvar ) ;
}
$_->join() for threads->list();
sub job_to_parallelize {
my ($a1, $b1, $lockvar) = #_;
{
lock $lockvar;
$a1->{a}++;
$b1->{d}++;
}
}
print Dumper({a1c => $a1c});
print Dumper({b1c => $b1c});
Output:
$VAR1 = {
'a1c' => {
'a' => 4,
'b' => 2
}
};
$VAR1 = {
'b1c' => {
'd' => 7,
'c' => 3
}
};
Related
i'm using AnyEvent to trigger event every X minutes. Some process are long and the result to record in the database are sometimes long to record (more than 3 seconds sometimes). I want to be sure that there is only one function that access to my database the same time. So I use lock from the threads::shared module. Here is my code :
main.pl :
my $var :shared;
$var = 0;
my $w1 = AnyEvent->timer(after => 0, interval => 120, cb => sub {
my #targeted_users = TargetUsers::get_targeted_account_from_followers($dbh,$account,'myrhline');
TargetUsers::save_users_to_follow($dbh,'RH',$var,#targeted_users);
});
my $w2 = AnyEvent->timer(after => 0, interval => 120, cb => sub {
my #targeted_users2 = TargetUsers::get_targeted_account_from_followers($dbh,$account2,'actuel_rh');
TargetUsers::save_users_to_follow($dbh,'RH',$var,#targeted_users2);
});
AnyEvent::Loop::run;
TargetUsers :
sub save_users_to_follow{
my ($dbh,$topic,$var,#targeted_users) = #_;
lock($var); #I call lock($var) to lock the access to the database.
my #array1 = ();
$dbh->{AutoCommit} = 0;
my $sth2 = $dbh->prepare(
"UPDATE People_TO_FOLLOW
SET Pertinence = Pertinence + 1
WHERE Id_user = ?");
my $tuples_update = $sth2->execute_array(
{ ArrayTupleStatus => \my #tuple_status_update },
\#targeted_users,
);
if ($tuples_update) {
print "Successfully updated $tuples_update records\n";
}
else {
for my $tuple_1 (0..$#targeted_users) {
my $status_1 = $tuple_status_update[$tuple_1];
$status_1 = [0, "Skipped"] unless defined $status_1;
next unless ref $status_1;
printf "Failed to update (%s): %s\n",
$targeted_users[$tuple_1], $status_1->[1];
}
}
$sth2->finish();
my $sth = $dbh ->prepare(
"INSERT OR IGNORE INTO People_TO_FOLLOW(Id_user,Topic,Already_followed,Pertinence) VALUES (?,'$topic',0,1)");
my $tuples_insert = $sth->execute_array(
{ ArrayTupleStatus => \my #tuple_status_insert },
\#targeted_users,
);
if ($tuples_insert) {
print "Successfully inserted $tuples_insert records\n";
}
else {
for my $tuple_2 (0..$#targeted_users) {
my $status_2 = $tuple_status_insert[$tuple_2];
$status_2 = [0, "Skipped"] unless defined $status_2;
next unless ref $status_2;
printf "Failed to insert (%s): %s\n",
$targeted_users[$tuple_2], $status_2->[1];
}
}
$sth->finish();
$dbh->commit;
$dbh->{AutoCommit} = 1;
}
But I get this error : "lock can only be used on shared values"
Have you some idea please ? Thanks
You have two vars named $var, the one in main.pl, and the one in save_users_to_follow. The former is the shared variable, but you are trying to lock the latter.
Move the var from main to TargetUsers, or pass a reference to the lock var to save_users_to_follow.
I originally experimented with trying to send a hash object through Thread::Queue, but according to this link, my versions of Thread::Queue and threads::shared is too old. Unfortunately, since the system I'm testing on isn't mine, I can't upgrade.
I then tried to use a common array to store my hashes. Here is the code so far:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
use constant NUM_WORKERS => 10;
my #out_array;
test1();
sub test1
{
my $in_queue = Thread::Queue->new();
foreach (1..NUM_WORKERS) {
async {
while (my $job = $in_queue->dequeue()) {
test2($job);
}
};
}
my #sentiments = ("Axe Murderer", "Mauler", "Babyface", "Dragon");
$in_queue->enqueue(#sentiments);
$in_queue->enqueue(undef) for 1..NUM_WORKERS;
$_->join() for threads->list();
foreach my $element (#out_array) {
print "element: $element\n";
}
}
sub test2
{
my $string = $_[0];
my %hash = (Skeleton => $string);
push #out_array, \%hash;
}
However, at the end of the procedure, #out_array is always empty. If I remove the threading parts of the script, then #out_array is correctly populated. I suspect I'm implementing threading incorrectly here.
How would I correctly populate #out_array in this instance?
You need to make it shared
use threads::shared;
my #out_array :shared;
I don't think you need to lock it if all you do is push onto it, but if you did, you'd use
lock #out_array;
You need to share any array or hash referenced by a value you push onto it using the tools in thread::shared.
push #out_array, share(%hash);
Though as I mentioned earlier, I'd use a Thread::Queue.
sub test2 {
my ($string) = #_;
my %hash = ( Skeleton => $string );
return \%hash;
}
...
my $response_q = Thread::Queue->new()
my $running :shared = NUM_WORKERS;
...
async {
while (my $job = $request_q->dequeue()) {
$response_q->enqueue(test2($job));
}
{ lock $running; $response_q->enqueue(undef) if !--$running; }
};
...
$request_q->enqueue(#sentiments);
$request_q->enqueue(undef) for 1..NUM_WORKERS;
while (my $response = $response_q->dequeue()) {
print "Skeleton: $response->{Skeleton}\n";
}
$_->join() for threads->list();
Note that lack of anything thread-specific in test2. This is good. You should always strive for separation of concerns.
You need to return your data from thread:
....
async {
my $data;
while (my $job = $in_queue->dequeue()) {
$data = test2($job);
}
return $data;
};
...
for ( threads->list() ) {
my $data = $_->join();
#now you have this thread return value in $data
}
sub test2
{
my $string = $_[0];
my %hash = (Skeleton => $string);
return \%hash;
}
I found my answer in the example here.
I had to change 2 things:
share the #out_array outside both subs
share the %hash in test2
add return; to the end of test2
Code outside both subs:
my #out_array : shared = ();
test2 sub:
sub test2
{
my $string = $_[0];
my %hash : shared;
$hash{Skeleton} = $string;
push #out_array, \%hash;
return;
}
I'm working on a project, implemented in Perl, and thought it would be an idea to use threads to distribute the work, because the tasks can be done independent of each other and only reading from shared data in memory. However, the performance is nowhere near as I expect it to be. So after some investigation I can only conclude that threads in Perl basically suck, but I keep wondering the performance goes down the drain as soon as I implement one single shared variable.
For example, this little program has nothing shared and consumes 75% of the CPU (as expected):
use threads;
sub fib {
my ( $n ) = #_;
if ( $n < 2 ) {
return $n;
} else {
return fib( $n - 1 ) + fib( $n - 2 );
}
}
my $thr1 = threads->create( 'fib', 35 );
my $thr2 = threads->create( 'fib', 35 );
my $thr3 = threads->create( 'fib', 35 );
$thr1->join;
$thr2->join;
$thr3->join;
And as soon as I introduce a shared variable $a, the CPU usage is somewhere between 40% and 50%:
use threads;
use threads::shared;
my $a : shared;
$a = 1000;
sub fib {
my ( $n ) = #_;
if ( $n < 2 ) {
return $n;
} else {
return $a + fib( $n - 1 ) + fib( $n - 2 ); # <-- $a was added here
}
}
my $thr1 = threads->create( 'fib', 35 );
my $thr2 = threads->create( 'fib', 35 );
my $thr3 = threads->create( 'fib', 35 );
$thr1->join;
$thr2->join;
$thr3->join;
So $a is read-only and no locking takes place, and yet the performance decreases. I'm curious why this happens.
At the moment I'm using Perl 5.10.1 under Cygwin on Windows XP. Unfortunately I couldn't test this on a non-Windows machine with a (hopefully) more recent Perl.
Your code is a tight loop around a synchronized structure. Optimize it by having each thread copy the shared variable -- just once for each thread -- into an unshared variable.
Constructing a shared object containing lots of data is possible in Perl and not worry about extra copies. There is no impact to performance when spawning workers, because the shared data resides inside a separate thread or process, depending on whether using threads.
use MCE::Hobo; # use threads okay or parallel module of your choice
use MCE::Shared;
# The module option constructs the object under the shared-manager.
# There's no trace of data inside the main process. The construction
# returns a shared reference containing an id and class name.
my $data = MCE::Shared->share( { module => 'My::Data' } );
my $b;
sub fib {
my ( $n ) = #_;
if ( $n < 2 ) {
return $n;
} else {
return $b + fib( $n - 1 ) + fib( $n - 2 );
}
}
my #thrs;
push #thrs, MCE::Hobo->create( sub { $b = $data->get_keys(1000), fib(35) } );
push #thrs, MCE::Hobo->create( sub { $b = $data->get_keys(2000), fib(35) } );
push #thrs, MCE::Hobo->create( sub { $b = $data->get_keys(3000), fib(35) } );
$_->join() for #thrs;
exit;
# Populate $self with data. When shared, the data resides under the
# shared-manager thread (via threads->create) or process (via fork).
package My::Data;
sub new {
my $class = shift;
my %self;
%self = map { $_ => $_ } 1000 .. 5000;
bless \%self, $class;
}
# Add any getter methods to suit the application. Supporting multiple
# keys helps reduce the number of trips via IPC. Serialization is
# handled automatically if getter method were to return a hash ref.
# MCE::Shared will use Serial::{Encode,Decode} if available - faster.
sub get_keys {
my $self = shift;
if ( wantarray ) {
return map { $_ => $self->{$_} } #_;
} else {
return $self->{$_[0]};
}
}
1;
Is there a multiprocessing module for Perl? Something that has similar functionality to what's offered by Python's multiprocessing module.
I understand I could build similar functionality using Perl, but I'm looking for something already implemented.
forks provides the same awesome interface as threads, but uses processes instead of threads.
use forks; # Or: use threads;
use Thread::Queue;
my $q = Thread::Queue->new();
my #workers;
for (1..NUM_WORKERS) {
push #workers, async {
while (defined(my $job = $q->dequeue())) {
...
}
};
}
$q->enqueue(...);
$q->enqueue(undef) for #workers;
$_->join for #workers;
Comparing forks with Forks::Super.
Keep in mind, these are suppose to the be the cases where Forks::Super excels!
use Forks::Super;
sub do_something { my #args = #_; ... }
$process = fork { sub => \&do_something, args => [#args] };
$process->wait;
can be written as
use forks;
sub do_something { my #args = #_; ... }
$process = async { do_something(#args) };
$process->join;
---
use Forks::Super;
my $x = 42;
my #y = ();
my %z = ();
sub do_something_else {
$x = 19;
#y = qw(foo bar);
%z = (foo => 'bar');
}
$process = fork { sub => 'do_something_else', share => [\$x, \#y, \%z ] };
$process->wait;
can be written as
use forks;
use forks::shared;
my $x :shared = 42;
my #y :shared = ();
my %z :shared = ();
sub do_something_else {
$x = 19;
#y = qw(foo bar);
%z = (foo => 'bar');
}
$process = async { do_something_else() };
$process->join;
---
use Forks::Super;
use IO::Handle;
pipe my $child_read, my $parent_write;
pipe my $parent_read, my $child_write;
$parent_write->autoflush(1);
$child_write->autoflush(1);
sub square {
while (my $x = <$child_read>) {
chomp($x);
print {$child_write} $x ** 2, "\n";
}
close $child_write;
}
$process = fork { sub => 'square' };
print { $parent_write } "9\n";
chomp( my $result = <$parent_read> ); # 81
close $parent_write;
$process->wait;
can be written as
use forks;
use Threads::Queue;
my $req = Threads::Queue->new();
my $resp = Threads::Queue->new();
sub square { $_[0] ** 2 }
$process = async {
while (defined(my $x = $req->dequeue())) {
$resp->enqueue( square($x) );
}
};
$req->enqueue(9);
my $result = $resp->dequeue(); # 81
$resp->enqueue(undef);
$process->join;
---
use Forks::Super;
sub square_root {
sleep 1 && seek STDIN,0,1 while eof(STDIN); # ok, this is a workaround for an existing bug :-(
while (my $x = <STDIN>) {
chomp($x);
print sqrt($x), "\n";
}
}
$process = fork { sub => 'square_root', child_fh => 'in,out,block' };
$process->write_stdin("81\n");
chomp( $result = $process->read_stdout() ); # 9
$process->close_fh('stdin');
$process->wait;
can be written as
use forks;
use Threads::Queue;
my $req = Threads::Queue->new();
my $resp = Threads::Queue->new();
$process = async {
while (defined(my $x = $req->dequeue())) {
$resp->enqueue( sqrt($x) );
}
};
$req->enqueue(81);
my $result = $resp->dequeue(); # 9
$resp->enqueue(undef);
$process->join;
I think Forks::Super comes pretty close. It has a few features for running an arbitrary subroutine (or external command) in a background process, monitoring and signalling the background process, and making interprocess communication a little less painful.
use Forks::Super;
sub do_something { my #args = #_; ... }
$process = fork { sub => \&do_something, args => [#args] };
$process->wait;
my $x = 42;
my #y = ();
my %z = ();
sub do_something_else {
$x = 19;
#y = qw(foo bar);
%z = (foo => 'bar');
}
$process = fork { sub => 'do_something_else', share => [\$x, \#y, \%z ] };
$process->wait;
# $x, #y, and %z are now updated with changes made in background process
# create your own pipes to use for IPC
use IO::Handle;
pipe my $child_read, my $parent_write;
pipe my $parent_read, my $child_write;
$parent_write->autoflush(1);
$child_write->autoflush(1);
sub square {
while (my $x = <$child_read>) {
print {$child_write} $x ** 2, "\n";
}
close $child_write;
}
$process = fork { sub => 'square' };
print {$parent_write} "9\n";
my $result = <$parent_read>; # should be "81\n";
close $parent_write;
# or use the standard I/O handles for IPC
sub square_root {
sleep 1 && seek STDIN,0,1 while eof(STDIN); # ok, this is a workaround for an existing bug :-(
while (my $x = <STDIN>) {
print sqrt($x), "\n";
}
}
$process = fork { sub => 'square_root', child_fh => 'in,out,block' };
$process->write_stdin("81\n");
$result = $process->read_stdout(); # => "9\n"
Both the multiprocessing module and Forks::Super have a lot of features. Which ones are you specifically interested in?
I am the author of Forks::Super and my goal is to include any features for parallel processing that people find useful, so if there's a feature in multiprocessing that you want in Perl, let me know.
What about POE: Perl Object Environment? It has support for asynchronous child processes.
You can use https://github.com/marioroy/mce-perl
It is similar to python multiprocess module
I have little scraping application and trying to add multithreading to it. Here is code (MyMech is WWW::Mechanize subclass used to process HTTP errors):
#!/usr/bin/perl
use strict;
use MyMech;
use File::Basename;
use File::Path;
use HTML::Entities;
use threads;
use threads::shared;
use Thread::Queue;
use List::Util qw( max sum );
my $page = 1;
my %CONFIG = read_config();
my $mech = MyMech->new( autocheck => 1 );
$mech->quiet(0);
$mech->get( $CONFIG{BASE_URL} . "/site-map.php" );
my #championship_links =
$mech->find_all_links( url_regex => qr/\d{4}-\d{4}\/$/ );
foreach my $championship_link (#championship_links) {
my #threads;
my $queue = Thread::Queue->new;
my $queue_processed = Thread::Queue->new;
my $url = sprintf $championship_link->url_abs();
print $url, "\n";
next unless $url =~ m{soccer}i;
$mech->get($url);
my ( $last_round_loaded, $current_round ) =
find_current_round( $mech->content() );
unless ($last_round_loaded) {
print "\tLoading rounds data...\n";
$mech->submit_form(
form_id => "leagueForm",
fields => {
round => $current_round,
},
);
}
my #match_links =
$mech->find_all_links( url_regex => qr/matchdetails\.php\?matchid=\d+$/ );
foreach my $link (#match_links) {
$queue->enqueue($link);
}
print "Starting printing thread...\n";
my $printing_thread = threads->create(
sub { printing_thread( scalar(#match_links), $queue_processed ) } )
->detach;
push #threads, $printing_thread;
print "Starting threads...\n";
foreach my $thread_id ( 1 .. $CONFIG{NUMBER_OF_THREADS} ) {
my $thread = threads->create(
sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
->join;
push #threads, $thread;
}
undef $queue;
undef $queue_processed;
foreach my $thread ( threads->list() ) {
if ( $thread->is_running() ) {
print $thread->tid(), "\n";
}
}
#sleep 5;
}
print "Finished!\n";
sub printing_thread {
my ( $number_of_matches, $queue_processed ) = #_;
my #fields =
qw (
championship
year
receiving_team
visiting_team
score
average_home
average_draw
average_away
max_home
max_draw
max_away
date
url
);
while ($number_of_matches) {
if ( my $match = $queue_processed->dequeue_nb ) {
open my $fh, ">>:encoding(UTF-8)", $CONFIG{RESULT_FILE} or die $!;
print $fh join( "\t", #{$match}{#fields} ), "\n";
close $fh;
$number_of_matches--;
}
}
threads->exit();
}
sub scrape_match {
my ( $thread_id, $queue, $queue_processed ) = #_;
while ( my $match_link = $queue->dequeue_nb ) {
my $url = sprintf $match_link->url_abs();
print "\t$url", "\n";
my $mech = MyMech->new( autocheck => 1 );
$mech->quiet(0);
$mech->get($url);
my $match = parse_match( $mech->content() );
$match->{url} = $url;
$queue_processed->enqueue($match);
}
return 1;
}
And i have some strange things with this code. Sometimes it run but sometimes it exit with no errors (at the ->detach point). I know that #match_links contain data but threads are not created and it just close. Usually it terminates after processing second $championship_link entry.
May be i'm doing something wrong?
Update
Here is code for find_current_round subroutine (but i'm sure it's not related to the question):
sub find_current_round {
my ($html) = #_;
my ($select_html) = $html =~ m{
<select\s+name="round"[^>]+>\s*
(.+?)
</select>
}isx;
my ( $option_html, $current_round ) = $select_html =~ m{
(<option\s+value="\d+"(?:\s+ selected="selected")?>(\d+)</option>)\Z
}isx;
my ($last_round_loaded) = $option_html =~ m{selected};
return ( $last_round_loaded, $current_round );
}
First off - don't use dequeue_nb(). This is a bad idea, because if a queue is temporarily empty, it'll return undef and your thread will exit.
Use instead dequeue and and end. dequeue will block, but once you end your queue, the while will exit.
You're also doing some decidedly odd things with your threads - I would suggest that you rarely want to detach a thread. You're just assuming your thread is going to complete before your program, which isn't a good plan.
Likewise this;
my $thread = threads->create(
sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
->join;
You're spawning a thread, and then instantly joining it. And so that join call will... block waiting for your thread to exit. You don't need threads at all to do that...
You also scope your queues within your foreach loop. I don't think that's a good plan. I would suggest instead - scope them externally, and spawn a defined number of 'worker' threads (and one 'printing' thread).
And then just feed them through the queue mechanism. Otherwise you'll end up creating multiple queue instances, because they're lexically scoped.
And once you've finished queuing stuff, issue a $queue -> end which'll terminate the while loop.
You also don't need to give a thread a $thread_id because ... they already have one. Try: threads -> self -> tid(); instead.