Perl hand Module to threads - multithreading

i am trying to pass a subroutine from an self written module to threads using the following code.
This is my first time using threads so I'm kinda not familiar with it.
Main Script (shortend)
#!/usr/bin/perl -w
use strict;
use threads;
use lib 'PATH TO LIB';
use goldstandard;
my $delete_raw_files = 0;
my $outfolder = /PATH/;
my %folder = goldstandard -> create_folder($outfolder,$delete_raw_files);
&tagging if $tagging == 1;
sub tagging{
my %hash = goldstandard -> tagging_hash(\%folder);
my #threads;
foreach(keys %hash){
if($_ =~ m/mate/){
my $arguments = "goldstandard -> mate_tagging($hash{$_}{raw},$hash{$_}{temp},$hash{$_}{tagged},$mate_anna,$mate_model)";
push(#threads,$arguments);
}
if($_ =~ m/morpheus/){
my $arguments = "goldstandard -> morpheus_tagging($hash{$_}{source},$hash{$_}{tagged},$morpheus_stemlib,$morpheus_cruncher)";
push(#threads,$arguments)
}
}
foreach(#threads){
my $thread = threads->create($_);
$thread ->join();
}
}
Module
package goldstandard;
use strict;
use warnings;
sub mate_tagging{
my $Referenz = shift;
my $input = shift;
my $output_temp_dir = shift;
my $output_mate_human = shift;
my $anna = shift;
my $model = shift;
opendir(DIR,"$input");
my #dir = readdir(DIR);
my $anzahl = #dir;
foreach(#dir){
unless($_ =~ m/^\./){
my $name = $_;
my $path = $input . $_;
my $out_temp = $output_temp_dir . $name;
my $out_mate_human_final = $output_mate_human . $name;
qx(java -Xmx10G -classpath $anna is2.tag.Tagger -model $model -test $path -out $out_temp);
open(OUT, "> $out_mate_human_final");
open(TEMP, "< $out_temp");
my $output_text;
while(<TEMP>){
unless($_ =~ m/^\s+$/){
if ($_ =~ m/^\d+\t(.*?)\t_\t_\t_\t(.*?)\t_\t/) {
my $tags = $2;
my $words = $1;
print OUT "$words\t$tags\n";
}
}
}
}
}
}
sub morpheus_tagging{
my $Referenz = shift;
my $input = shift;
my $output = shift;
my $stemlib = shift;
my $cruncher = shift;
opendir(DIR,"$input");
my #dir = readdir(DIR);
foreach(#dir){
unless($_ =~ m/^\./){
my $name = $_;
my $path = $input . $_;
my $out = $output . $name;
qx(env MORPHLIB='$stemlib' '$cruncher' < '$path' > '$out');
}
}
}
1;
Executing this code gets me
Thread 1 terminated abnormally: Undefined subroutine &main::goldstandard -> morpheus_tagging(...) called at ... line 43.
I guess eather the way I am calling the treads or the way I am providing the arguments are wrong. I Hope some can help me with that? I Also found something on safe and unsafe modules bum I'm not sure is this is realy the problem.
I guess eather the way I am calling the treads or the way I am providing the arguments are wrong. I Hope some can help me with that? I Also found something on safe and unsafe modules bum I'm not sure is this is realy the problem.Thanks in advance

You must pass the name of a sub or a reference to a sub, plus arguments, to threads->create. So you need something like
my $method_ref = $invoker->can($method_name);
threads->create($method_ref, $invoker, #args);
That said, passing arguments to threads->create has issues that can be avoided by using a closure.
threads->create(sub { $invoker->$method_name(#args) })
The above can be written more simply as follows:
async { $invoker->$method_name(#args) }
This gets us the following:
sub tagging {
my %hash = goldstandard->tagging_hash(\%folder);
my #jobs;
for (keys %hash) {
if (/mate/) {
push #jobs, [ 'goldstandard', 'mate_tagging',
$hash{$_}{raw},
$hash{$_}{temp},
$hash{$_}{tagged},
$mate_anna,
$mate_model,
];
}
if (/morpheus/) {
push #jobs, [ 'goldstandard', 'morpheus_tagging',
$hash{$_}{source},
$hash{$_}{tagged},
$morpheus_stemlib,
$morpheus_cruncher,
];
}
}
my #threads;
for my $job (#jobs) {
my ($invoker, $method_name, #args) = #$job;
push #threads, async { $invoker->$method_name(#args) };
}
$_->join for #threads;
}
or just
sub tagging {
my %hash = goldstandard->tagging_hash(\%folder);
my #threads;
for (keys %hash) {
if (/mate/) {
push #threads, async {
goldstandard->mate_tagging(
$hash{$_}{raw},
$hash{$_}{temp},
$hash{$_}{tagged},
$mate_anna,
$mate_model,
);
};
}
if (/morpheus/) {
push #threads, async {
goldstandard->morpheus_tagging(
$hash{$_}{source},
$hash{$_}{tagged},
$morpheus_stemlib,
$morpheus_cruncher,
);
};
}
}
$_->join for #threads;
}
Notes that I delayed the calls to join until after all the threads are created. Your way made it so only one thread would run at a time.
But what we have isn't great. We have no way of limiting how many threads are active at a time, and we (expensively) create many threads instead of reusing them. We can use a worker pool to solve both of these problems.
use constant NUM_WORKERS => 5;
use Thread::Queue 3.01 qw( );
my $q;
sub tagging {
my %hash = goldstandard->tagging_hash(\%folder);
my #threads;
for (keys %hash) {
if (/mate/) {
$q->enqueue(sub {
goldstandard->mate_tagging(
$hash{$_}{raw},
$hash{$_}{temp},
$hash{$_}{tagged},
$mate_anna,
$mate_model,
);
});
}
if (/morpheus/) {
$q->enqueue(sub {
goldstandard->morpheus_tagging(
$hash{$_}{source},
$hash{$_}{tagged},
$morpheus_stemlib,
$morpheus_cruncher,
);
});
}
}
}
{
$q = Thread::Queue->new();
for (1..NUM_WORKERS) {
async {
while ( my $job = $q->dequeue() ) {
$job->();
}
};
}
... call tagging and whatever ...
$q->end();
$_->join() for threads->list();
}

Related

Perl: struct-like storage of data

I have an existing perl script which I have to modify. For this, I need some struct-like container for my data. I do not have any 'outside' modules, nor do I possess the capability to obtain them, and my perl is 5.8.8. I've written a package containing one var and two arrays for my needs, however I cannot get it to work and I am not sure why. Here it is:
{
package TestData;
sub new
{
my $class = shift;
my $self = {
_id => shift,
_genUsers => [],
_testSymbols => [],
};
return bless ($self, $class);
}
sub setId
{
my ($self, $id) = #_;
$self->{_id} = $id if defined($id);
}
sub addGenUser
{
my ($self, $user) = #_;
push #{$self->{_genUsers}}, $user;
}
sub addTestSymbol
{
my ($self, $sym) = #_;
push #{$self->{_testSymbols}}, $sym;
}
sub getId
{
my $self = #_;
return $self->{_id};
}
sub getGenUserList
{
my $self = #_;
return #{$self->{_genUsers}};
}
sub getTestSymbolList
{
my $self = #_;
return #{$self->{_testSymbols}};
}
}
use strict;
use POSIX;
my $id = "test";
my #a;
push #a, "this";
my #b;
push #b, "that";
my $obj = new TestData($id, #a, #b);
print "DEBUG: " . $obj->getId() . "\n";
Last line always throws "use of uninitialized value". What's going on?
Also would it be possible to do something like this:
my #c;
push #c, $obj;
foreach(#c){
print "DEBUG2: " . $_->getId() . "\n";
}
Thank you.
EDIT: Thank you everyone for your replies. This is what the fully working end result looked like:
{
use strict;
use POSIX;
package TestData;
sub new
{
my $class = shift;
my $self = {
_alpha => shift,
_beta => shift,
_gamma => shift,
_delta => shift,
_theta => shift
};
return bless ($self, $class);
}
sub getAlpha
{
my $self = shift;
return $self->{_alpha};
}
sub getBeta
{
my $self = shift;
return $self->{_beta};
}
sub getGamma
{
my $self = shift;
return $self->{_gamma};
}
sub getDelta
{
my $self = shift;
return $self->{_delta};
}
sub getTheta
{
my $self = shift;
return $self->{_theta};
}
}
The bug is in the line(s) that read:
my $self = #_;
This is a scalar assignment, and so assigns the length on #_ to $self. Trying to use the length as a hash reference then gives the "use of uninitialized value" warning you see, and returns undef.
By the way, in addition to enabling warnings, you should get into the habit of starting all your scripts and modules with use strict;. If you'd done that, it would've caught this bug, and you would've received a runtime error saying something like:
Can't use string ("1") as a HASH ref while "strict refs" in use at foo.pl line 36.
Anyway, to fix your code, you should replace the line above with either:
my ($self) = #_;
or:
my $self = shift;
or even:
my $self = $_[0];
With my ($self) = #_ the parentheses turn it into a list assignment; with my $self = shift it's still a scalar assignment, but shift (which, inside a sub, is shorthand for shift #_) pulls the first scalar out of #_.
The choice of which style to prefer basically comes down to personal taste and consistency; if you're used to pulling all your method args out of #_ with a single assignment like:
my ($self, $foo, $bar, $baz) = #_;
then, for methods that take no arguments other than $self, a single-element list assignment seems more consistent. On the other hand, if you prefer to pick your args out of #_ one at a time, as in:
my $self = shift;
my $foo = shift;
my $bar = shift;
my $baz = shift;
the you obviously should follow that pattern for single-arg methods too.
Change
my $self = #_;
to
my $self = shift;
in your methods that only expect one argument and especifically the called one (i.e.):
sub getId
{
my $self = shift;
return $self->{_id};
}
You can also wrap the $self around parentheses to change context and do the same thing as commented
Output:
DEBUG: test
Rather than roll your own, you might want to check out Class::Struct, which has been part of Perl's core since 5.4.

Perl: How to push a hash into an array that is outside of a subroutine

I originally experimented with trying to send a hash object through Thread::Queue, but according to this link, my versions of Thread::Queue and threads::shared is too old. Unfortunately, since the system I'm testing on isn't mine, I can't upgrade.
I then tried to use a common array to store my hashes. Here is the code so far:
#!/usr/bin/perl
use strict;
use warnings;
use threads;
use Thread::Queue;
use constant NUM_WORKERS => 10;
my #out_array;
test1();
sub test1
{
my $in_queue = Thread::Queue->new();
foreach (1..NUM_WORKERS) {
async {
while (my $job = $in_queue->dequeue()) {
test2($job);
}
};
}
my #sentiments = ("Axe Murderer", "Mauler", "Babyface", "Dragon");
$in_queue->enqueue(#sentiments);
$in_queue->enqueue(undef) for 1..NUM_WORKERS;
$_->join() for threads->list();
foreach my $element (#out_array) {
print "element: $element\n";
}
}
sub test2
{
my $string = $_[0];
my %hash = (Skeleton => $string);
push #out_array, \%hash;
}
However, at the end of the procedure, #out_array is always empty. If I remove the threading parts of the script, then #out_array is correctly populated. I suspect I'm implementing threading incorrectly here.
How would I correctly populate #out_array in this instance?
You need to make it shared
use threads::shared;
my #out_array :shared;
I don't think you need to lock it if all you do is push onto it, but if you did, you'd use
lock #out_array;
You need to share any array or hash referenced by a value you push onto it using the tools in thread::shared.
push #out_array, share(%hash);
Though as I mentioned earlier, I'd use a Thread::Queue.
sub test2 {
my ($string) = #_;
my %hash = ( Skeleton => $string );
return \%hash;
}
...
my $response_q = Thread::Queue->new()
my $running :shared = NUM_WORKERS;
...
async {
while (my $job = $request_q->dequeue()) {
$response_q->enqueue(test2($job));
}
{ lock $running; $response_q->enqueue(undef) if !--$running; }
};
...
$request_q->enqueue(#sentiments);
$request_q->enqueue(undef) for 1..NUM_WORKERS;
while (my $response = $response_q->dequeue()) {
print "Skeleton: $response->{Skeleton}\n";
}
$_->join() for threads->list();
Note that lack of anything thread-specific in test2. This is good. You should always strive for separation of concerns.
You need to return your data from thread:
....
async {
my $data;
while (my $job = $in_queue->dequeue()) {
$data = test2($job);
}
return $data;
};
...
for ( threads->list() ) {
my $data = $_->join();
#now you have this thread return value in $data
}
sub test2
{
my $string = $_[0];
my %hash = (Skeleton => $string);
return \%hash;
}
I found my answer in the example here.
I had to change 2 things:
share the #out_array outside both subs
share the %hash in test2
add return; to the end of test2
Code outside both subs:
my #out_array : shared = ();
test2 sub:
sub test2
{
my $string = $_[0];
my %hash : shared;
$hash{Skeleton} = $string;
push #out_array, \%hash;
return;
}

thread shared perl

i wrote a code and i need to make it multithreaded. Evething works, but every loop repeats 4 times:
use LWP::UserAgent;
use HTTP::Cookies;
use threads;
use threads::shared;
$| = 1;
$threads = 4;
my #groups :shared = loadf('groups.txt');
my #thread_list = ();
$thread_list[$_] = threads->create(\&thread) for 0 .. $threads - 1;
$_->join for #thread_list;
thread();
sub thread
{
my $url = 'http://www.site.ru/';
my $response = $web->post($url, Content =>
['st.redirect' => ''
]);
foreach $i (#groups)
{
my $response = $web->get($i);
if(!($response->header('Location')))
{
---------;
}
else
{
----------;
}
}
}
sub loadf {
open (F, "<".$_[0]) or erroropen($_[0]);
chomp(my #data = <F>);
close F;
return #data;
}
groups.txt :
http://www.odnoklassniki.ru/group/47357692739634
http://www.odnoklassniki.ru/group/56099517562922
I understand that i need to use threads::shared; but i can't undestand how to use it.
Your post does not have much context to explain the code sections; please explain your scenario more clearly.
The problem is that you never remove from #groups, so all threads do all jobs in #groups.
Here's one solution.
use threads;
use Thread::Queue 3.01 qw( );
my $NUM_WORKERS = 4;
sub worker {
my ($url) = #_;
... download the page ...
}
my $q = Thread::Queue->new();
for (1..$NUM_WORKERS) {
async {
while (my $url = $q->dequeue()) {
worker($url);
}
};
}
$q->enqueue($_) for loadf('groups.txt');
$q->end();
$_->join() for threads->list;
Why do you need to make it threaded? perl does much better using forks in most cases.
That said, your code starts 4 threads, each of which processes everything in #groups. It sounds like that's not what you want to do. If you want #groups to be a queue of work to do, take a look at Thread::Queue (or Parallel::ForkManager).

Is there a multiprocessing module for Perl?

Is there a multiprocessing module for Perl? Something that has similar functionality to what's offered by Python's multiprocessing module.
I understand I could build similar functionality using Perl, but I'm looking for something already implemented.
forks provides the same awesome interface as threads, but uses processes instead of threads.
use forks; # Or: use threads;
use Thread::Queue;
my $q = Thread::Queue->new();
my #workers;
for (1..NUM_WORKERS) {
push #workers, async {
while (defined(my $job = $q->dequeue())) {
...
}
};
}
$q->enqueue(...);
$q->enqueue(undef) for #workers;
$_->join for #workers;
Comparing forks with Forks::Super.
Keep in mind, these are suppose to the be the cases where Forks::Super excels!
use Forks::Super;
sub do_something { my #args = #_; ... }
$process = fork { sub => \&do_something, args => [#args] };
$process->wait;
can be written as
use forks;
sub do_something { my #args = #_; ... }
$process = async { do_something(#args) };
$process->join;
---
use Forks::Super;
my $x = 42;
my #y = ();
my %z = ();
sub do_something_else {
$x = 19;
#y = qw(foo bar);
%z = (foo => 'bar');
}
$process = fork { sub => 'do_something_else', share => [\$x, \#y, \%z ] };
$process->wait;
can be written as
use forks;
use forks::shared;
my $x :shared = 42;
my #y :shared = ();
my %z :shared = ();
sub do_something_else {
$x = 19;
#y = qw(foo bar);
%z = (foo => 'bar');
}
$process = async { do_something_else() };
$process->join;
---
use Forks::Super;
use IO::Handle;
pipe my $child_read, my $parent_write;
pipe my $parent_read, my $child_write;
$parent_write->autoflush(1);
$child_write->autoflush(1);
sub square {
while (my $x = <$child_read>) {
chomp($x);
print {$child_write} $x ** 2, "\n";
}
close $child_write;
}
$process = fork { sub => 'square' };
print { $parent_write } "9\n";
chomp( my $result = <$parent_read> ); # 81
close $parent_write;
$process->wait;
can be written as
use forks;
use Threads::Queue;
my $req = Threads::Queue->new();
my $resp = Threads::Queue->new();
sub square { $_[0] ** 2 }
$process = async {
while (defined(my $x = $req->dequeue())) {
$resp->enqueue( square($x) );
}
};
$req->enqueue(9);
my $result = $resp->dequeue(); # 81
$resp->enqueue(undef);
$process->join;
---
use Forks::Super;
sub square_root {
sleep 1 && seek STDIN,0,1 while eof(STDIN); # ok, this is a workaround for an existing bug :-(
while (my $x = <STDIN>) {
chomp($x);
print sqrt($x), "\n";
}
}
$process = fork { sub => 'square_root', child_fh => 'in,out,block' };
$process->write_stdin("81\n");
chomp( $result = $process->read_stdout() ); # 9
$process->close_fh('stdin');
$process->wait;
can be written as
use forks;
use Threads::Queue;
my $req = Threads::Queue->new();
my $resp = Threads::Queue->new();
$process = async {
while (defined(my $x = $req->dequeue())) {
$resp->enqueue( sqrt($x) );
}
};
$req->enqueue(81);
my $result = $resp->dequeue(); # 9
$resp->enqueue(undef);
$process->join;
I think Forks::Super comes pretty close. It has a few features for running an arbitrary subroutine (or external command) in a background process, monitoring and signalling the background process, and making interprocess communication a little less painful.
use Forks::Super;
sub do_something { my #args = #_; ... }
$process = fork { sub => \&do_something, args => [#args] };
$process->wait;
my $x = 42;
my #y = ();
my %z = ();
sub do_something_else {
$x = 19;
#y = qw(foo bar);
%z = (foo => 'bar');
}
$process = fork { sub => 'do_something_else', share => [\$x, \#y, \%z ] };
$process->wait;
# $x, #y, and %z are now updated with changes made in background process
# create your own pipes to use for IPC
use IO::Handle;
pipe my $child_read, my $parent_write;
pipe my $parent_read, my $child_write;
$parent_write->autoflush(1);
$child_write->autoflush(1);
sub square {
while (my $x = <$child_read>) {
print {$child_write} $x ** 2, "\n";
}
close $child_write;
}
$process = fork { sub => 'square' };
print {$parent_write} "9\n";
my $result = <$parent_read>; # should be "81\n";
close $parent_write;
# or use the standard I/O handles for IPC
sub square_root {
sleep 1 && seek STDIN,0,1 while eof(STDIN); # ok, this is a workaround for an existing bug :-(
while (my $x = <STDIN>) {
print sqrt($x), "\n";
}
}
$process = fork { sub => 'square_root', child_fh => 'in,out,block' };
$process->write_stdin("81\n");
$result = $process->read_stdout(); # => "9\n"
Both the multiprocessing module and Forks::Super have a lot of features. Which ones are you specifically interested in?
I am the author of Forks::Super and my goal is to include any features for parallel processing that people find useful, so if there's a feature in multiprocessing that you want in Perl, let me know.
What about POE: Perl Object Environment? It has support for asynchronous child processes.
You can use https://github.com/marioroy/mce-perl
It is similar to python multiprocess module

Threads application terminates unexpectedly

I have little scraping application and trying to add multithreading to it. Here is code (MyMech is WWW::Mechanize subclass used to process HTTP errors):
#!/usr/bin/perl
use strict;
use MyMech;
use File::Basename;
use File::Path;
use HTML::Entities;
use threads;
use threads::shared;
use Thread::Queue;
use List::Util qw( max sum );
my $page = 1;
my %CONFIG = read_config();
my $mech = MyMech->new( autocheck => 1 );
$mech->quiet(0);
$mech->get( $CONFIG{BASE_URL} . "/site-map.php" );
my #championship_links =
$mech->find_all_links( url_regex => qr/\d{4}-\d{4}\/$/ );
foreach my $championship_link (#championship_links) {
my #threads;
my $queue = Thread::Queue->new;
my $queue_processed = Thread::Queue->new;
my $url = sprintf $championship_link->url_abs();
print $url, "\n";
next unless $url =~ m{soccer}i;
$mech->get($url);
my ( $last_round_loaded, $current_round ) =
find_current_round( $mech->content() );
unless ($last_round_loaded) {
print "\tLoading rounds data...\n";
$mech->submit_form(
form_id => "leagueForm",
fields => {
round => $current_round,
},
);
}
my #match_links =
$mech->find_all_links( url_regex => qr/matchdetails\.php\?matchid=\d+$/ );
foreach my $link (#match_links) {
$queue->enqueue($link);
}
print "Starting printing thread...\n";
my $printing_thread = threads->create(
sub { printing_thread( scalar(#match_links), $queue_processed ) } )
->detach;
push #threads, $printing_thread;
print "Starting threads...\n";
foreach my $thread_id ( 1 .. $CONFIG{NUMBER_OF_THREADS} ) {
my $thread = threads->create(
sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
->join;
push #threads, $thread;
}
undef $queue;
undef $queue_processed;
foreach my $thread ( threads->list() ) {
if ( $thread->is_running() ) {
print $thread->tid(), "\n";
}
}
#sleep 5;
}
print "Finished!\n";
sub printing_thread {
my ( $number_of_matches, $queue_processed ) = #_;
my #fields =
qw (
championship
year
receiving_team
visiting_team
score
average_home
average_draw
average_away
max_home
max_draw
max_away
date
url
);
while ($number_of_matches) {
if ( my $match = $queue_processed->dequeue_nb ) {
open my $fh, ">>:encoding(UTF-8)", $CONFIG{RESULT_FILE} or die $!;
print $fh join( "\t", #{$match}{#fields} ), "\n";
close $fh;
$number_of_matches--;
}
}
threads->exit();
}
sub scrape_match {
my ( $thread_id, $queue, $queue_processed ) = #_;
while ( my $match_link = $queue->dequeue_nb ) {
my $url = sprintf $match_link->url_abs();
print "\t$url", "\n";
my $mech = MyMech->new( autocheck => 1 );
$mech->quiet(0);
$mech->get($url);
my $match = parse_match( $mech->content() );
$match->{url} = $url;
$queue_processed->enqueue($match);
}
return 1;
}
And i have some strange things with this code. Sometimes it run but sometimes it exit with no errors (at the ->detach point). I know that #match_links contain data but threads are not created and it just close. Usually it terminates after processing second $championship_link entry.
May be i'm doing something wrong?
Update
Here is code for find_current_round subroutine (but i'm sure it's not related to the question):
sub find_current_round {
my ($html) = #_;
my ($select_html) = $html =~ m{
<select\s+name="round"[^>]+>\s*
(.+?)
</select>
}isx;
my ( $option_html, $current_round ) = $select_html =~ m{
(<option\s+value="\d+"(?:\s+ selected="selected")?>(\d+)</option>)\Z
}isx;
my ($last_round_loaded) = $option_html =~ m{selected};
return ( $last_round_loaded, $current_round );
}
First off - don't use dequeue_nb(). This is a bad idea, because if a queue is temporarily empty, it'll return undef and your thread will exit.
Use instead dequeue and and end. dequeue will block, but once you end your queue, the while will exit.
You're also doing some decidedly odd things with your threads - I would suggest that you rarely want to detach a thread. You're just assuming your thread is going to complete before your program, which isn't a good plan.
Likewise this;
my $thread = threads->create(
sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
->join;
You're spawning a thread, and then instantly joining it. And so that join call will... block waiting for your thread to exit. You don't need threads at all to do that...
You also scope your queues within your foreach loop. I don't think that's a good plan. I would suggest instead - scope them externally, and spawn a defined number of 'worker' threads (and one 'printing' thread).
And then just feed them through the queue mechanism. Otherwise you'll end up creating multiple queue instances, because they're lexically scoped.
And once you've finished queuing stuff, issue a $queue -> end which'll terminate the while loop.
You also don't need to give a thread a $thread_id because ... they already have one. Try: threads -> self -> tid(); instead.

Resources