multithreading or forking in perl - multithreading

in my perl script I'm collecting a large data and later I need it to post to server, up to this I'm good but my criteria is that post to server takes subsequently large time so I need to a threading / forking concept so that one will post and parallely I can dig my second data set at same time while posting to server is taking place.
code snippet
if(system("curl -sS $post_url --data-binary \#$filename -H 'Content-type:text/xml;charset=utf-8' 1>/dev/null") != 0)
{
exit_script(" xml: Error ","Unable to update $filename xml on $post_url");
}
can any one tell me is this achievable with threading or forking.

It's difficult to give an answer to your question, because it depends.
Yes, Perl supports both forking and threading.
In general, I would suggest looking at threading for data-oriented tasks, and forking for almost anything else.
And so what you want to so is eminently achievable.
First you need to:
Encapsulate your tasks into subroutines. Get that working first. (This is very important - parallel stuff causes worlds of pain and is difficult to troubleshoot if you're not careful - get it working single threaded first).
Run your subroutines as threads, and capture their results.
Something like this:
use threads;
sub curl_update {
my $result = system ( "you_curl_command" );
return $result;
}
#start the async curl
my $thr = threads -> create ( \&curl_update );
#do your other stuff....
sleep ( 60 );
my $result = $thr -> join();
if ( $result ) {
#do whatever you would if the curl update failed
}
In this, the join is a blocking call - your main code will stop and wait for your thread to complete. If you want to do something more complicated, you can use is_running or is_joinable which are non blocking.

I'd suggest neither.
You're just talking lots of HTTP. You can talk concurrent HTTP a lot nicer, because it's just network IO, by using any of the asynchronous IO systems. Perl has many of them.
Principally I'd suggest IO::Async, but then I wrote it. You can use Net::Async::HTTP to make an HTTP hit. This will fully support doing many of them at once - many hundreds or thousands if need be.
Otherwise, you can also try either POE or AnyEvent, which will both support the same thing in their own way.

Related

How to speed up Parallel::ForkManager in perl

I am using EC2 amazon server to perform data processing of 63 files,
the server i am using has 16 core but using perl Parallel::ForkManager with number of thread = number of core then it seems like half the core are sleeping and the working core are not at 100% and fluctuate around 25%~50%
I also checked IO and it is mostly iddling.
use Sys::Info;
use Sys::Info::Constants qw( :device_cpu );
my $info = Sys::Info->new;
my $cpu = $info->device( CPU => %options );
use Parallel::ForkManager;
my $manager=new Parallel::ForkManager($cpu->count);
for($i=0;$i<=$#files_l;$i++)
{
$manager->start and next;
do_stuff($files_l[$i]);
$manager->finish;
}
$manager->wait_all_children;
The short answer is - we can't tell you, because it depends entirely on what 'do_stuff' is doing.
The major reasons why parallel code doesn't create linear speed increases are:
Process creation overhead - some 'work' is done to spawn a process, so if the children are trivially small, that 'wastes' effort.
Contented resources - the most common is disk IO, but things like file locks, database handles, sockets, or interprocess communication can also play a part.
something else causing a 'back off' that stalls a process.
And without knowing what 'do_stuff' does, we can't second guess what it might be.
However I'll suggest a couple of steps:
Double the number of processes to twice CPU count. That's often a 'sweet spot' because it means that any non-CPU delay in a process just means one of the others get to go full speed.
Try strace -fTt <yourprogram> (if you're on linux, the commands are slightly different on other Unix variants). Then do it again with strace -fTtc because the c will summarise syscall run times. Look at which ones take the most 'time'.
Profile your code to see where the hot spots are. Devel::NYTProf is one library you can use for this.
And on a couple of minor points:
my $manager=new Parallel::ForkManager($cpu->count);
Would be better off written:
my $manager=Parallel::ForkManager -> new ( $cpu->count);
Rather than using indirect object notation.
If you are just iterating #files then it might be better to not use a loop count variable and instead:
foreach my $file ( #files ) {
$manager -> start and next;
do_stuff($file);
$manager -> finish;
}

Perl modules to use for parallel processing

I want to parallelize a program written in Perl.
The code loops over multiple files and calls a subroutine for each file.
I also need to share some read only local data structures with the subroutine.
sub process_in_parallel {
my $readOnlySchema = foo();
foreach my $file ( #files ) {
validate_the_file($file,$readOnlySchema);
}
What are the Perl modules that the perl monks can recommend for this scenario. I tried some of the following:
threads
The problem with this is managing the threads. Is there an efficient thread manager or thread pool library that can help me with this? I am also not sure if I can share the read only object.
Parallel::ForkManager
The problem with this is that it forks processes rather than threads and is increasing the time of execution in my case.
I have the same question posted here : http://perlmonks.com/?node_id=1182517

multithreading with MQ

I'm having problem using MQSeries Perl module in multi-threading environment. Here what I have tried:
create two handle in different thread with $mqMgr = MQSeries::QueueManager->new(). I thought this would give me two different connection to MQ, but instead I got return code 2219 on the second call to MQOPEN(), which probably means I got the same underling connection to mq from two separate call to new() method.
declare only one $mqMgr as global shared variable. But I can't assign reference to an MQSeries::QueueManager object to $mqMgr. The reason is "Type of arg 1 to threads::shared::share must be one of [$#%] (not subroutine entry)"
declare only one $mqMgr as global variable. Got same 2219 code.
Tried to pass MQCNO_HANDLE_SHARE_NO_BLOCK into MQSeries::QueueManager->new(), so that a single connection can be shared across thread. But I can not find a way to pass it in.
My question is, with Perl module MQSeries
How/can I get separate connection to MQ queue manager from different thread?
How/can I share a connection to MQ queue manager across different thread?
I have looked around but with little luck, Any info would be appreciated.
related question:
C++ - MQ RC Code 2219
Update 1: add a example that two local MQSeries::QueueManager object in two thread cause MQ error code 2219.
use threads;
use Thread::Queue;
use MQSeries;
use MQSeries::QueueManager;
use MQSeries::Queue;
# globals
our $jobQ = Thread::Queue->new();
our $resultQ = Thread::Queue->new();
# ----------------------------------------------------------------------------
# sub routines
# ----------------------------------------------------------------------------
sub worker {
# fetch work from $jobQ and put result to $resultQ
# ...
}
sub monitor {
# fetch result from $resultQ and put it onto another MQ queue
my $mqQMgr = MQSeries::QueueManager->new( ... );
# different queue from the one in main
# this would cause error with MQ code 2219
my $mqQ = MQSeries::Queue->new( ... );
while (defined(my $result = $resultQ->dequeue())) {
# create an mq message and put it into $mqQ
my $mqMsg = MQSeries::Message->new();
$mqQ->put($mqMsg);
}
}
# main
unless (caller()) {
# create connection to MQ
my $mqQMgr = MQSeries::QueueManager->new( ... );
my $mqQ = MQSeries::Queue->new( ... );
# create worker and monitor thread
my #workers;
for (1 .. $nThreads) {
push(#workers, threads->create('worker'));
}
my $monitor = threads->create('monitor');
while (True) {
my $mqMsg = MQSeries::Message->new ();
my $retCode = $mqQ->get(
Message => $mqMsg,
GetMsgOptions => $someOption,
Wait => $sometime
);
die("error") if ($retCode == 0);
next if ($retCode == -1); # no message
# not we have some job to do
$jobQ->enqueue($mqMsg->Data);
}
}
There is a very real danger when trying to multithread with modules that the module is not thread safe. There's a bunch of things that can just break messily because of the way threading works - you clone the current process state, and that includes things like file handles, sockets, etc.
But if you try and use them in an asynchronous/threaded way, they'll act really weird because the operations aren't (necessarily) atomic.
So whilst I can't answer your question directly, because I have no experience of the particular module:
Unless you know otherwise, assume you can't share between threads. It might be thread safe, it might not. If it isn't, it might still look ok, until one day you get a horrifically difficult to find bug as a result of a race condition in concurrent conditions.
A shared scalar/list is explicitly described in threads::shared as basically safe (and even then, you can still have problems with non-atomicity if you're not locking).
I would suggest therefore that what you need to do is either:
have a 'comms' thread, that does all the work related to the module, and make the other threads use IPC to talk to it. Thread::Queue can work nicely for this.
treat each thread as entirely separate for purposes of the module. That includes loading it (with require and import - not use because that acts earlier) and instantiating. (You might get away with 'loading' the module before threads start, but instantiating does things like creating descriptors, sockets etc.)
lock stuff when there's any danger of interruption of an atomic operation.
Much of the above also applies to fork parallelism too - but not in quite the same way, as fork makes "sharing" stuff considerably harder, so you're less likely to trip over it.
Edit:
Looking at the code you've posted, and crossreferencing against the MQSeries source:
There is a BEGIN block, that sets up some stuff with the MQSeries at the point at which you use it.
Whilst I can't say for sure that this is your problem, it makes me very wary - because bear in mind that when it does that, it sets up some stuff - and then when your threads start, they inherit non-shared copies of "whatever it did" during that "BEGIN" block.
So in light of what I suggested earlier on - I would recommend you try (because I can't say for sure, as I don't have a reference implementation):
require MQSeries;
MQSeries->import;
Put this in your code - in lieu of use - after thread start. E.g. after you do the creates and within the thread subroutine.

Perl parallel HTTP requests - out of memory

First of all, I'm new to Perl.
I want to make multiple (e.g. 160) HTTP GET requests on a REST API in Perl. Executing them one after another takes much time, so I was thinking of running the requests in parallel. Therefore I used threads to execute more requests at the same time and limited the number of parallel requests to 10.
This worked just fine for the first time I ran the program, the second time I ran 'out of memory' after the 40th request.
Here's the code: (#urls contains the 160 URLs for the requests)
while(#urls) {
my #threads;
for (my $j = 0; $j < 10 and #urls; $j++) {
my $url = shift(#urls);
push #threads, async { $ua->get($url) };
}
for my $thread (#threads) {
my $response = $thread->join;
print "$response\n";
}
}
So my question is, why am I NOT running out of memory the first time but the second time (am I missing something crucial in my code)? And what can I do to prevent it?
Or is there a better way of executing parallel GET requests?
I'm not sure why you would get a OOM error on a second run when you don't get one on the first run; when you run a Perl script and the perl binary exits, it'll release all of it's memory back to the OS. Nothing is kept between executions. Is the exact same data being returned by the REST service each time? Maybe there's more data the second time you run and it's pushing you over the edge.
One problem I notice is that you're launching 10 threads and running them to completion, then spawning 10 more threads. A better solution may be a worker-thread model. Spawn 10 threads (or however many you want) at the start of the program, put the URLs into a queue, and allow the threads to process the queue themselves. Here's a quick example that may help:
use strict;
use warnings;
use threads;
use Thread::Queue;
my $q = Thread::Queue->new();
my #thr = map {
threads->create(sub {
my #responses = ();
while (defined (my $url = $q->dequeue())) {
push #responses, $ua->get($url);
}
return #responses;
});
} 1..10;
$q->enqueue($_) for #urls;
$q->enqueue(undef) for 1..10;
foreach (#thr) {
my #responses_of_this_thread = $_->join();
print for #responses_of_this_thread;
}
Note, I haven't tested this to make sure it works. In this example, you create a new thread queue and spawn up 10 worker threads. Each thread will block on the dequeue method until there is something to be read. Next, you queue up all the URLs that you have, and an undef for each thread. The undef will allow the threads to exit when there is no more work to perform. At this point, the threads will go through and process the work, and you will gather the responses via the join at the end.
Whenever I need an asynchronous solution Perl, I first look at the POE framework. In this particular case I used POE HTTP Request module that will allow us to send multiple requests simultaneously and provide a callback mechanism where you can process your http responses.
Perl threads are scary and can crash your application, especially when you join or detach them. If responses do not take a long time to process, a single threaded POE solution would work beautifully.
Sometimes though, we have to a rely on threading because application gets blocked due to long running tasks. In those cases, I create a certain number of threads BEFORE initiating anything in the application. Then with Thread::Queue I pass the data from the main thread to these workers AND never join/detach them; always keep them around for stability purposes.
(Not an ideal solution for every case.)
POE supports threads now and each thread can run a POE::Kernel. The kernels can communicate with each other through TCP sockets (which POE provides nice unblocking interfaces).

Thread joining problems

am using Perl on a linux box and my memory usage is going up and up, I believe because of previously run threads that have not been killed/joined.
I think I need to somehow signal the thread(s) that have done/run to terminate, and then detach
it/them so that it/they will get cleaned up automatically giving me back memory...
I have tried return(); with $thr_List->join(); & $thr_List->detach(); but my gui doesn't show for ages with join, and the mem problem seems
to be still there with the detach... Any help...
$mw->repeat(100, sub { # shared var handler/pivoting-point !?
while (defined(my $command = $q->dequeue_nb())) { #???
# to update a statusbar's text
$text->delete('0.0', "end");
$text->insert('0.0', $command);
$indicatorbar->value($val); # to update a ProgressBar's value to $val
$mw->update();
for ( #threadids ) { # #threadids is a shared array containing thread ids
# that is to say I have got the info I wanted from a thread and pushed its id into the above #threadids array.
print "I want to now kill or join the thread with id: $_\n";
#$thrWithId->detach();
#$thrWithId->join();
# then delete that id from the array
# delete $threadids[elWithThatIdInIt];
# as this seems to be in a repeat(100, sub... too, there are problems??!
# locks maybe?!?
# for ( #threadids ) if its not empty?!?
}
} # end of while
}); # end of sub
# Some worker... that works with the above handler/piviot me thinks#???
async {
for (;;) {
sleep(0.1);
$q->enqueue($StatusLabel);
}
}->detach();
I have uploaded my full code here (http://cid-99cdb89630050fff.office.live.com/browse.aspx/.Public) if needed, its in the Boxy.zip...
First sorry for replying here but I've lost my cookie that allows editing etc...
Thanks very much gangabass, that looks like great info, I will have to spend some time on it though, but at least it looks like others are asking the same questions... I was worried I was making a total mess of things.
Thanks guys...
So it sounds like you got the join working but it was very slow?
Threading in perl is not lightweight. Creating and joining threads takes significant memory and time.
If your task allows, it is much better to keep threads running and give them additional work rather than ending them and starting new threads later. Thread::Queue can help with this. That said, unless you are on Windows, there is not a lot of point to doing that instead of forking and using Parallel::ForkManager.
You need to use only one Worker thread and update GUI from it. Other threads just process data from the queue and there is no need to terminate them.
See this example for more info

Resources