I'm using Mojolicious non-blocking methods (Promises) to request data from external systems. 1) I'd like to notify the user immediately that the process has started; 2) I'd like to scale this program.
The code below works for a small set of numbers (few hundreds), with more numbers, I get an error [error] Can't create pipe: Too many open files at /path/lib/perl5/Mojo/IOLoop.pm line 156. Question 1) How can I limit the number of Promises I spawn (map in my code below):
#!/usr/bin/env perl
use Mojolicious::Lite;
use Mojolicious::Plugin::TtRenderer;
sub isPrime
{
my ($n) = #_;
my $e = sqrt($n);
for (my $i=2; $i<$e; $i++) {
return 0 if $n%$i==0;
}
return 1;
}
sub makeApromise
{
my ($number) = #_;
my $promise = Mojo::Promise->new;
Mojo::IOLoop->subprocess(
sub { # first callback is executed in subprocess
my %response;
# Simulate a long computational process
$response{'number'} = $number;
$response{'isPrime'} = isPrime($number);
return \%response;
},
sub { # second callback resolves promise with subprocess result
my ($self, $err, #result) = #_;
return $promise->reject($err) if $err;
$promise->resolve(#result);
},
);
return $promise;
}
plugin 'tt_renderer'; # automatically render *.html.tt templates
any '/' => sub {
my ($self) = #_;
my $lines = $self->param( 'textarea' );
if ($lines) {
my #numbers;
foreach my $number (split(/\r?\n/, $lines)) {
push(#numbers, $number) if $number =~ /^\d+$/;
}
if (#numbers) {
####################################
### This is the problem below... ###
my #promises = map { makeApromise($_) } #numbers;
####################################
# MojoPromise Wait
Mojo::Promise->all(#promises)
->then(sub {
my #values = map { $_->[0] } #_;
foreach my $response (#values) {
#print STDERR $response->{'number'}, " => ", $response->{'isPrime'}, "\n";
# Prepare email...
}
# Send an email...
})
#->wait # Don't wait? I want to tell the user to wait for an email as quickly as possible...
if #promises;
}
$self->stash(done => "1",);
}
$self->render(template => 'index', format => 'html', handler => 'tt');
};
app->start;
__DATA__
## index.html.tt
<!DOCTYPE html>
<html lang="en">
<head>
<title>Make A Promise</title>
</head>
<body>
[% IF done %]
<h3>Thank you! You will receive an email shortly with the results.</h3>
[% ELSE %]
<h3>Enter numbers...</h3>
<form role="form" action="/" method="post">
<textarea name="textarea" rows="5" autofocus required></textarea>
<button type="submit">Submit</button>
</form>
[% END %]
</body>
</html>
I commented out the wait; however, it appears the code is still blocking. Question 2) How can I notify the user immediately that the process has already started? (i.e. when I stash the done variable)
The problem isn't the number of promises but the number of subprocesses. One way to limit this is to simply limit how many you create at a time in your program logic. Instead of spawning them all at once in a map, set a limit and retrieve that many from #numbers (perhaps using splice) and spawn those subprocesses; create an ->all promise that waits on those and attach a ->then to that promise to retrieve your next chunk of numbers, and so on.
Another option is to use Future::Utils fmap_concat which can take care of the rate-limiting code by have you provide a number of the maximum outstanding Futures. Your promise-returning function can apply Mojo::Promise::Role::Futurify to chain a following Future to use in this manner.
#!/usr/bin/env perl
use Mojolicious::Lite;
use Mojo::File 'path';
use Mojo::IOLoop;
use Mojo::Promise;
use Future::Utils 'fmap_concat';
get '/' => sub {
my $c = shift;
my $count = $c->param('count') // 0;
my #numbers = 1..$count;
if (#numbers) {
my $result_f = fmap_concat {
my $number = shift;
my $p = Mojo::Promise->new;
Mojo::IOLoop->subprocess(sub {
sleep 2;
return $number+1;
}, sub {
my ($subprocess, $err, #result) = #_;
return $p->reject($err) if $err;
$p->resolve(#result);
});
return $p->with_roles('Mojo::Promise::Role::Futurify')->futurify;
} foreach => \#numbers, concurrent => 20;
$result_f->on_done(sub {
my #values = #_;
foreach my $response (#values) {
$c->app->log->info($response);
}
})->on_fail(sub {
my $error = shift;
$c->app->log->fatal($error);
})->retain;
$c->stash(done => 1);
}
$c->render(text => "Processing $count numbers\n");
};
app->start;
As for the wait method, this does nothing when the event loop is already running, which in a webapp response handler it will be, if you started the application in a Mojolicious daemon (as opposed to a PSGI or CGI server which don't support asynchronous responses). The ->stash and ->render calls outside of the callbacks will be run immediately after setting up the subprocesses. Then the response handler will complete, and the event loop will have control again, which will fire the appropriate ->then callbacks once the promises resolve. The render should not be waiting for anything beyond the setting up of subprocesses; since you said there may be hundreds, that could be the slowdown you're experiencing. Make sure you are using Mojolicious 7.86 or newer as Subprocess was changed so the fork will not happen until the next tick of the event loop (after your response handler has completed).
I'll also note that Subprocesses aren't really designed for this; they're designed for executing slow code that still returns an eventual result to the browser in a response (and Mojolicious::Plugin::Subprocess is nice for this use case). One problem I can see is that if you restart the application, any still pending subprocesses will just be ignored. For jobs that you want to set off and forget, you might consider a job queue like Minion which has great integration into Mojolicious apps, and runs via a separate worker process.
Related
I have a threaded application and would like to set timeouts for the threads. Peldoc for alarm suggests to use a eval-die pair and catch the ALRM signal. However, this fails with threads producing the error Alarm clock:
use strict; use warnings;
require threads;
require threads::shared;
my $t = threads->create( sub {
eval {
$SIG{ALRM} = sub { die "alarm\n" };
alarm 2;
main();
alarm 0;
};
if ($#){
die $# unless $# eq "alarm\n";
print "timed out\n";
}
}
);
my #r = $t->join;
print "done\n";
sub main {
sleep 3;
}
This post suggests that alarm is called without signal handler in the threads library. Another post is about this problem and answers suggest to use fork and waitpid, but I would really like to use threads. Another post claims to come up with a solution, but this still gives the Alarm clock error for me. I tried to catch Alarm clock in the if ($#), but no success. Any idea how I could make this work?
The whole idea of using alarm in threads is problematic.
Signals are sent to processes, not threads.
What if two threads want to use alarm?
You'll have to implement your own system. The following is an attempt at a general solution:
package Threads::Alarm;
use strict;
use warnings;
use threads;
use threads::shared;
use Exporter qw( import );
our #EXPORT_OK = qw( alarm thread_alarm );
# A list of "$time:$tid" strings sorted by ascending time.
my #alarms :shared;
sub thread_alarm {
my ($wait) = #_;
my $tid = threads->tid();
lock #alarms;
# Cancel existing alarm for this thread, if any.
for my $i (0..$#alarms) {
if ((split(/:/, $alarms[$i]))[1] == $tid) {
splice(#alarms, $i, 1);
last;
}
}
# Create an alarm
if ($wait) {
my $when = time() + $wait;
# A binary search would be better.
my $i;
for ($i=0; $i<#alarms; ++$i) {
last if $when < (split(/:/, $alarms[$i]))[0];
}
splice(#alarms, $i, 0, "$when:$tid");
}
# Notify others of change to #alarms.
cond_broadcast(#alarms);
}
{
no warnings 'once';
*alarm = \&thread_alarm;
}
threads->create(sub {
while (1) {
my $thread;
{
lock #alarms;
while (1) {
# Wait for an alarm request to come in.
cond_wait(#alarms) while !#alarms;
# Grab the soonest alarm.
my ($when, $tid) = split(/:/, $alarms[0]);
# Check if the thread still exists.
my $thread = threads->object($tid)
or last;
# Wait for the #alarms to change or for the alarm time.
last if !cond_timedwait(#alarms, $when);
}
# Before releasing the lock, remove the alarm we're about to raise.
shift(#alarms);
# Notify others of change to #alarms.
# Doesn't actually do anything at this time.
cond_broadcast(#alarms);
}
$thread->kill('ALRM') if $thread;
}
})->detach();
1;
Completely untested. Well, I made sure it compiles, but that's it.
Note that threads->kill doesn't send a real signal (since those are sent to processes, not threads), so the OS won't interrupt any operation (e.g. sleep, wait). Simple solution: Send a real signal to a handler that does nothing right after calling threads->kill. Maybe I should have written a solution that was based around the actual SIGALRM.
I'm trying to scrape some URLS from a webservice, its working perfect but I need to scrape something like 10,000 pages from the same web servicve.
I do this by creating multiple phantomJS processes and they each open and evaluate a different URL (Its the same service, all I change is one parameter in the URL of the website).
Problem is I don't want to open 10,000 pages at once, since I don't want their service to crash, and I don't want my server to crash either.
I'm trying to make some logic of opening/evaluating/insertingToDB ~10 pages, and then sleeping for 1 minute or so.
Let's say this is what I have now:
var numOfRequests = 10,000; //Total requests
for (var dataIndex = 0; dataIndex < numOfRequests; dataIndex++) {
phantom.create({'port' : freeport}, function(ph) {
ph.createPage(function(page) {
page.open("http://..." + data[dataIncFirstPage], function(status) {
I want to insert somewhere in the middle something like:
if(dataIndex % 10 == 0){
sleep(60); //I can use the sleep module
}
Every where I try to place sleepJS the program crashes/freezes/loops forever...
Any idea what I should try?
I've tried placing the above code as the first line after the for loop, but this doesn't work (maybe because of the callback functions that are waiting to fire..)
If I place it inside the phantom.create() callback also doesn't work..
Realize that NodeJS runs asynchronously and in your for-loop, each method call is being executing one after the other. That phantom.create call finishes near immediately, and then the next cycle of the for-loop kicks in.
To answer your question, you want the sleep command at the end of the phantom.create block, still in side the for-loop. Like this:
var numOfRequests = 10000; // Total requests
for( var dataIndex = 0; dataIndex < numOfRequests; dataIndex++ ) {
phantom.create( { 'port' : freeport }, function( ph ) {
// ..whatever in here
} );
if(dataIndex % 10 == 0){
sleep(60); //I can use the sleep module
}
}
Also, consider using a package to help with these control flow issues. Async is a good one, and has a method, eachLimit that will concurrently run a number of processes, up to a limit. Handy! You will need to create an input object array for each iteration you wish to run, like this:
var dataInputs = [ { id: 0, data: "/abc"}, { id : 1, data : "/def"} ];
function processPhantom( dataItem, callback ){
console.log("Starting processing for " + JSON.stringify( dataItem ) );
phantom.create( { 'port' : freeport }, function( ph ) {
// ..whatever in here.
//When done, in inner-most callback, call:
//callback(null); //let the next parallel items into the queue
//or
//callback( new Error("Something went wrong") ); //break the processing
} );
}
async.eachLimit( dataInputs, 10, processPhantom, function( err ){
//Can check for err.
//It is here that everything is finished.
console.log("Finished with async.eachLimit");
});
Sleeping for a minute isn't a bad idea, but in groups of 10, that will take you 1000 minutes, which is over 16 hours! Would be more convenient for you to only call when there is space in your queue - and be sure to log what requests are in process, and have completed.
I've been writing a "checker" system that performs various "checks" on various services, systems, databases, files, etc. A "check" is generic in nature and can be anything. All checks are reported in a common format weather they pass or fail, whatever that may be.
It is written in a modular OO fashion so that developers can simply follow the framework and write checks independently of one and other. Each object contains a shared reporting object which after they run a check they simply $self->{'reporting'}->report(params). The params are defined and developers are assumed to report appropriately. The reporting object then indexes these reports. My main loader script has entries such as the following:
my $reportingObject = new Checks::Reporting(params);
my #checks;
push #checks, new Checks::Check_One($reportingObject, params));
push #checks, new Checks::Check_One($reportingObject, params));
.
.
push #checks, new Checks::Check_N($reportingObject, params));
To kick off the checks and finalize the report once they are done I have been doing:
foreach my $check (#checks) {
$check->run_stuff();
}
$reportingObject->finalize_report();
Now since these checks are totally independent (do not worry about the reporting object) they can be run in parallel. As an improvement I have done:
my #threads;
foreach my $check (#checks) {
push #threads, async { $check->run_stuff(); }
}
foreach my $thread (#threads) {
$thread->join;
}
#All threads are complete, and thus all checks are done
$reportingObject->finalize_report();
As I said earlier the developers will write Checks independently of each other. Some checks are simple and others are not. The simple checks may not have asynchronous code in them, but others might need to run asynchronously internally such as
sub do_check {
my #threads;
my #list = #{$self->{'list'}};
foreach my $item (#list) {
push #threads, async {
#do_work_on_$item
#return 1 or 0 for success or fail
};
foreach my $thread (#threads) {
my $res = $thread->join;
if($res == 1) {
$self->{'reporting'}->report(params_here);
}
}
}
}
As you can see the threading model allows me to do things in very vague terms. Each "Check" no matter what it is runs independently in its own thread. If an individual developer has asynchronous stuff to do, no matter what it is, he simply does it independently in its own thread. I want a model similar to this.
Unfortunately threads are slow and inefficient. All of the async libraries have specific watchers such as IO, etc. I do not want anything specific. I would like an event based model that allows me to simply kick off async tasks, no matter what they are, and simply notify when they are all done so I can move on.
Hopefully that explains it and you can point me in the right direction.
This seems like a good fit for a boss-worker model:
Spawn a few workers at the beginning of the program. Make sure they all have access to a queue.
Enqueue as many checks as you like. The workers dequeue the checks, execute them, and enqueue the result in an output queue.
Your main thread looks at the results from the output thread, and does whatever it wants.
Join the workers in an END block
You probably want to look at Thread::Queue::Any if there is a chance you want to put coderefs into the queue.
Here is a fully runnable example:
use strict; use feature 'say';
use threads; use threads::shared; use Thread::Queue::Any;
use constant NUM_THREADS => 5;
local $Storable::Deparse = 1; local $Storable::Eval = 1; # needed to serialize code
my $check_q = Thread::Queue::Any->new;
my $result_q = Thread::Queue::Any->new;
# start the workers
{
my $running :shared = NUM_THREADS;
my #threads = map threads->new(\&worker, $check_q, $result_q, \$running), 1..NUM_THREADS;
END { $_->join for #threads }
}
# enqueue the checks
$check_q->enqueue($_) for sub {1}, sub{2}, sub{"hi"}, sub{ die };
$check_q->enqueue(undef) for 1..NUM_THREADS; # end the queue
while(defined( my $result = $result_q->dequeue )) {
report($$result);
}
sub report {
say shift // "FAILED";
}
sub worker {
my ($in, $out, $running_ref) = #_;
while (defined( my $check = $in->dequeue )) {
my $result = eval { $check->() };
$out->enqueue(\$result);
}
# last thread closes the door
lock $$running_ref;
--$$running_ref || $out->enqueue(undef);
}
This prints
1
2
hi
FAILED
in a slightly random order.
My Perl script needs to run multiple threads simultaneously...
use threads ('yield', 'exit' => 'threads_only');
use threads::shared;
use strict;
use warnings;
no warnings 'threads';
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Async;
use ...
...and such threads need to obtain some information from web, so HTTP::Async is used.
my $request = HTTP::Request->new;
$request->protocol('HTTP/1.1');
$request->method('GET');
$request->header('User-Agent' => '...');
my $async = HTTP::Async->new( slots => 100,
timeout => REQUEST_TIMEOUT,
max_request_time => REQUEST_TIMEOUT );
But some threads need to access web only when other thread(s) says so.
my $start = [Time::HiRes::gettimeofday()];
my #threads = ();
foreach ... {
$thread = threads->create(
sub {
local $SIG{KILL} = sub { threads->exit };
my $url = shift;
if ($url ... ) {
# wait for "go" signal from other threads
}
my ($response, $data);
$request->url($url);
$data = '';
$async->add($request);
while ($response = $async->wait_for_next_response) {
threads->yield();
$data .= $response->as_string;
}
if ($data ... ) {
# send "go" signal to waiting threads
}
}
}, $_);
if (defined $thread) {
$thread->detach;
push (#threads, $thread);
}
}
There might be one or more threads waiting for "go" signal and there might be one or more threads that such "go" signal can send. At the beginning the status of semaphore is "wait" and once it turns to "go", it will stay so.
Finally, app checks max running time. If threads are running too long, self-termination signal is sent.
my $running;
do {
$running = 0;
foreach my $thread (#threads) {
$running++ if $thread->is_running();
}
threads->yield();
} until (($running == 0) ||
(Time::HiRes::tv_interval($start) > MAX_RUN_TIME));
$running = 0;
foreach my $thread (#threads) {
if ($thread->is_running()) {
$thread->kill('KILL');
$running++;
}
}
threads->yield();
Now to the point. My questions are:
How can I most effectively code waiting "semaphore" in the script (see comments in script above). Should I simply use just shared variable with some dummy sleep loop?
Do I need to add some sleep loop at the end of app to give time to threads for self-destruction?
You might look at Thread::Queue to perform this work. You could setup a queue that would handle the signaling between the threads waiting for the 'go' signal and the threads sending the 'go' signal. Here's a quick mock-up that I haven't tested:
...
use Thread::Queue;
...
# In main body
my $q = Thread::Queue->new();
...
$thread = threads->create(
sub {
local $SIG{KILL} = sub { threads->exit };
my $url = shift;
if ($url ... ) {
# wait for "go" signal from other threads
my $mesg = $q->dequeue();
# you could put in some termination code if the $mesg isn't 'go'
if ($mesg ne 'go') { ... }
}
...
if ($data ... ) {
# send "go" signal to waiting threads
$q->enqueue('go');
}
}
}, $_);
...
The threads that need to wait for a 'go' signal will wait on the dequeue method until something enters the queue. Once a message enters the queue one thread and only one thread will grab the message and process it.
If you wish to stop the threads so that they won't run, you can insert a stop message to the head of the queue.
$q->insert(0, 'stop') foreach (#threads);
There are examples in Thread::Queue and threads CPAN distributions that show this in more detail.
In response to your second question, the answer is, unfortunately, it depends. When you proceed to terminate your threads, what kind of clean up is required for a clean shutdown? What's the worst case scenario that could occur if the rug was yanked out from beneath the thread? You would want to plan in any time for the clean up to occur. The other option you could do is wait on each thread to actually complete.
The reason for my comment asking if you could remove the detach call is because this method allows the main thread to exit and not care what was happening to any child threads. Instead, if you remove this call, and add:
$_->join() foreach threads->list();
to the end of your main block, this will require the main application to wait for each thread to actually complete.
If you leave the detach method in place, then you will need to sleep at the end of your code if you require your threads to perform any sort of clean-up. When you call detach on a thread, what you are telling Perl is that you don't care what the thread is doing when your main thread exits. If the main thread exits and there are threads that still running that have been detached, then the program will finish with no warnings. However, if you don't require any clean-up, and you still call detach, feel free to exit whenever you like.
Try out something like this....
#!/usr/bin/perl
use threads;
use threads::shared;
$|=1;
my ($global):shared;
my (#threads);
push(#threads, threads->new(\&mySub,1));
push(#threads, threads->new(\&mySub,2));
push(#threads, threads->new(\&mySub,3));
$i = 0;
foreach my $myThread(#threads)
{
my #ReturnData = $myTread->join ;
print "Thread $i returned: #ReturnData\n";
$i++;
}
sub mySub
{
my ($threadID) = #_;
for(0..1000)
{
$global++;
print "Thread ID: $threadID >> $_ >> GLB: $global\n";
sleep(1);
}
return( $id );
}
In this program POE::Wheel::FollowTail works well for following the tail of a file, it is also running in a separate thread to simply monitor the progress of a compile job.
Inside the InputEvent handler there's a crude regex to extract compile results, and there everything is working fine, but I cannot get any result values to be accessible outside this sub. Even if I put result variables in the global scope they are not modified.
The program consists of one process running the compile job, another watching the log, and the main loop waiting.
Global scope:
my $Pass = 0;
my $Done = 0;
Then to kick off the monitoring:
threads->create(\&StartWatcher);
Where the watch-log file sub looks like this:
sub StartWatcher
{
my $logfile = "filename.log";
# Create the logfile watcher
POE::Session->create
(
inline_states =>
{
_start => sub
{
$_[HEAP]{tailor} = POE::Wheel::FollowTail->new( Filename => $logfile, InputEvent => "got_log_line", );
},
got_log_line => sub
{
$Pass += () = $_[ARG0] =~ /^\d+.*vcproj \- 0 error\(s\), \d+ warning\(s\)/g;
$Done += () = $_[ARG0] =~ /^\d+.*vcproj \- \d+ error\(s\), \d+ warning\(s\)/g;
print "POE InputEvent Pass: $Pass, Done: $Done\n"; # Debug output
},
}
);
POE::Kernel->run();
}
The $logfile is being written by a Visual Studio compile job started using Win32::Process::Create and the main Perl execution is sitting in this loop waiting for the compiler to terminate, and producing a status output every second.
while('true')
{
$ProcessObj->Wait(100); # milliseconds wait
$ProcessObj->GetExitCode($exitcode);
if ( $exitcode == STILL_ACTIVE )
{
"Compiling... [$Done/$Count] Pass: $Pass Failed: $failed"
if($RunCounter++ % 10 == 0);
next;
}
last;
}
The output produced is similar to this:
POE InputEvent Pass: 1, Done: 1
Compiling... [0/91] Pass: 0 Failed: 0
ie. in the InputEvent handler got_log_line the two global variables have been incremented, yet in the Perl main loop they are still at zero. I realise that I could do the print output from the InputEvent handler but why doesn't it modify global variables?
What is going wrong?
Threading in perl doesn't work in same way as in other languages, the program space is not shared. In thread creation, current thread is copied into new one, which separated from the parent one (each thread has it's own instrance of perl interpret). If you want to communicate between the threads, look at threads::shared, Thread::Queue and Thread::Semaphore.
Following from anydot's suggestion above here's the answer:
Before starting the thread, create a shared queue
use threads;
use Thread::Queue;
use threads::shared;
my $queue:shared = Thread::Queue->new();
In the input event, create some shared data to enqueue
my %data:shared = ();
$data{PASS_VCPRJ_COUNT} = () = $_[ARG0] =~ /^\d+.*vcproj.*0 error.*\d+ warning/g;
$data{DONE_VCPRJ_COUNT} = () = $_[ARG0] =~ /^\d+.*vcproj.*d+ error.*\d+ warning/g;
$queue->enqueue(\%data) if($data{DONE_VCPRJ_COUNT} ne 0 || $data{PASS_VCPRJ_COUNT} ne 0);
Then in the screen update code, dequeue it, here non-blocking
if (defined(my $item = $queue->dequeue_nb()))
{
foreach my $key(%$item)
{print "$key $item->{$key}\n";}
}
There are other ways, I'm sure but this works for me.
Many thanks.