Hi I have a simple script that takes a file and runs another Perl script on it. The script does this to every picture file in the current folder. This is running on a machine with 2 quad core Xeon processors, 16gb of ram, running RedHat Linux.
The first script work.pl basically calls magicplate.pl passes some parameters and the name of the file for magicplate.pl to process. Magic Plate takes about a minute to process each image. Because work.pl is preforming the same function over 100 times and because the system has multiple processors and cores I was thinking about splitting the task up so that it could run multiple times in parallel. I could split the images up to different folders if necessary. Any help would be great. Thank you
Here is what I have so far:
use strict;
use warnings;
my #initialImages = <*>;
foreach my $file (#initialImages) {
if($file =~ /.png/){
print "processing $file...\n";
my #tmp=split(/\./,$file);
my $name="";
for(my $i=0;$i<(#tmp-1);$i++) {
if($name eq "") { $name = $tmp[$i]; } else { $name=$name.".".$tmp[$i];}
}
my $exten=$tmp[(#tmp-1)];
my $orig=$name.".".$exten;
system("perl magicPlate.pl -i ".$orig." -min 4 -max 160 -d 1");
}
}
You should consider NOT creating a new process for each file that you want to process -- It's horribly inefficient, and probably what is taking most of your time here. Just loading up Perl and whatever modules you use over and over ought to be creating some overhead. I recall a poster on PerlMonks that did something similar, and ended up transforming his second script into a module, reducing the worktime from an hour to a couple of minutes. Not that you should expect such a dramatic improvement, but one can dream..
With the second script refactored as a module, here's an example of thread usage, in which BrowserUK creates a thread pool, feeding it jobs through a queue.
You could use Parallel::ForkManager (set $MAX_PROCESSES to the number of files processed at the same time):
use Parallel::ForkManager;
use strict;
use warnings;
my #initialImages = <*>;
foreach my $file (#initialImages) {
if($file =~ /.png/){
print "processing $file...\n";
my #tmp=split(/\./,$file);
my $name="";
for(my $i=0;$i<(#tmp-1);$i++) {
if($name eq "") { $name = $tmp[$i]; } else { $name=$name.".".$tmp[$i];}
}
my $exten=$tmp[(#tmp-1)];
my $orig=$name.".".$exten;
$pm = new Parallel::ForkManager($MAX_PROCESSES);
my $pid = $pm->start and next;
system("perl magicPlate.pl -i ".$orig." -min 4 -max 160 -d 1");
$pm->finish; # Terminates the child process
}
}
But as suggested by Hugmeir running perl interpreter again and again for each new file is not a good idea.
Import "maigcplate" and use threading.
Start magicplate.pl in the background (you would need to add process throttling)
Import "magicplate" and use fork (add process throttling and a kiddy reaper)
Make "maigcplate" a daemon with a pool of workers = # of CPUs
use an MQ implementation for communication
use sockets for communication
Use webserver(nginx, apache, ...) and wrap in REST for a webservice
etc...
All these center around creating multiple workers that can each run on their own cpu. Certain implementations will use resources better (those that don't start a new process) and be easier to implement and maintain.
Related
I am working on a script which needs to spawn an Expect process periodically (every 5 mins) to do some work. Below is the code that I have that spawns an Expect process and does some work. The main process of the script is doing some other work at all times, for example it may wait for user input, because of that I am calling this function 'spawn_expect' in a thread that keeps calling it every 5 minutes, but the issue is that the Expect is not working as expected.
If however I replace the thread with another process, that is if I fork and let one process take care of spawning Expect and the other process does the main work of the script (for example waiting at a prompt) then Expect works fine.
My question is that is it possible to have a thread spawn Expect process ? do I have to resort to using a process to do this work ? Thanks !
sub spawn_expect {
my $expect = Expect->spawn($release_config{kinit_exec});
my $position = $expect->expect(10,
[qr/Password.*: /, sub {my $fh = shift; print $fh "password\n";}],
[timeout => sub {print "Timed out";}]);
# if this function is run via a process, $position is defined, if it is run via a thread, it is not defined
...
}
Create the Expect object beforehand (not inside a thread) and pass it to a thread
my $exp = Expect->spawn( ... );
$exp->raw_pty(1);
$exp->log_stdout(0);
my ($thr) = threads->create(\&login, $exp);
my #res = $thr->join();
# ...
sub login {
my $exp = shift;
my $position = $exp->expect( ... );
# ...
}
I tested with multiple threads, where one uses Expect with a custom test script and returns the script's output to the main thread. Let me know if I should post these (short) programs.
When the Expect object is created inside a thread it fails for me, too. My guess is that in that case it can't set up its pty the way it does that normally.
Given the clarification in a comment I'd use fork for the job though.
Here's my code:
use LWP;
use threads;
use strict;
use warnings;
my #thrds;
for (1..100)
{
push #thrds, threads->create( 'doit' );
}
for (#thrds)
{
$_->join();
}
sub doit
{
LWP::UserAgent->new->get("http://dx.doi.org/10.1002/aoc.1067");
}
I'm using Windows 7 x64 and ActivePerl 5.20.2 x64, I also tried StrawberryPerl.
I've got bunch of errors:
Thread ... terminated abnormally: Can't locate object method "_uric_escape" via package "URI" at .../URI.pm line 81
String found where operator expected at (eval 10) line 8, near "croak 'usage: $io->getlines()'"
(Do you need to predeclare croak?)
If I add
sleep 1;
before
push #thrds, threads->create( 'doit' );
it'll be ok.
What's the problem?
I'm not sure why, but there seems to be problems dealing with dynamically-loaded modules. Explicitly loading them before thread creation solves the problem. In other words, add the following:
use Carp qw( );
use URI qw( );
I think the problem here will be memory footprint. You are - after all - loading the LWP library, and then cloning your process 100 times.
Contrary to popular belief, threads in perl are not even remotely lightweight. They are not well suited to this model of usage - each thread is a 'full copy' of your process, and that's just ... not a great plan.
My copy of perl - ActivePerl 5.20.2 - doesn't exhibit the same problem (I don't think - I didn't actually want to spam that website you list).
I would suggest instead that you either rewrite your threading to use Thread::Queue and a lower degree of parallelism:
use strict;
use warnings;
use LWP;
use threads;
use Thread::Queue;
my $workers = 10;
my $work_q = Thread::Queue->new();
my $url = "http://localhost:80";
sub worker_thread {
while ( my $url = $work_q->dequeue ) {
LWP::UserAgent->new->get($url);
}
}
threads->create( \&worker_thread ) for 1 .. $workers;
for ( 1 .. 100 ) {
$work_q->enqueue($url);
}
$work_q->end;
foreach my $thread ( threads->list ) {
$thread->join();
}
Otherwise, a fork-based approach may also work better:
use strict;
use warnings;
use Parallel::ForkManager;
use LWP;
my $manager = Parallel::ForkManager->new(10);
for ( 1 .. 100 ) {
$manager->start and next;
LWP::UserAgent->new->get("http://localhost:80");
$manager->finish;
}
$manager->wait_all_children;
Edit:
A bit more testing on your sample - I do get similar runtime errors on RHEL running 5.20.2
They vary somewhat, which really does mean there has to be some kind of race condition going on here.
Which is odd, because threads are supposed to be standalone and they're not.
In particular - my kernel bombs with a 'kill' because of memory exhaustion thanks to a few gigs of memory footprint, which is a pretty good reason not to use this approach.
My test case of running your code with RHEL and Perl 5.20.2 also triggers problems (occasionally).
What I get is similar sorts of errors - sporadically. I cannot see an obvious source thought. It might be as simple as too many open file descriptors or too much memory consumed. This is quite a hefty memory burden.
This is very straightforward using the Mojolicious framework. The Mojo::UserAgent class has been designed to work asynchronously with the help of the Mojo::IOLoop module, and although synchronous transactions are available they are implemented using special cases of the standard asynchronous calls
use strict;
use warnings;
use Mojo;
my $ua = Mojo::UserAgent->new( max_redirects => 5 );
my $n;
for ( 1 .. 100 ) {
$ua->get('http://dx.doi.org/10.1002/aoc.1067' => \&completed);
++$n;
}
Mojo::IOLoop->start;
sub completed {
my ($ua, $tx) = #_;
Mojo::IOLoop->stop unless --$n > 0;
}
A quick benchmark gave the following results
100 synchronous GET requests took 178 seconds
100 asynchronous GET requests took 10 seconds
I know this is an old topic, but just yesterday I ran into this problem. Below info can be useful for someone.
I was getting several kinds of errors from LWP in similar code (parallel http requests), including
"String found where operator expected at ... near "croak ..."
"Deep recursion on subroutine "IO::Socket::new" ... Out of memory!"
Sometimes requests just failed (returned status 500 or 501).
Problem seems to be gone after I added the following line:
use IO::File;
Don't ask me why, I have no idea :) Figured this out accidentally.
I am using Strawberry Perl 5.24.0.
I need to write a script, that takes an array of values and multithreaded way it (forks?) runs another script with a value from array as a param, but so max running forks would be set, so it would wait for script to finish if there are more than n running already. How do I do that?
There is a plugin named child_process, but not sure how to get it done, as it always waits for child termination.
Basically, in PHP it would be something like this (wrote it from head, may contain some syntax errors):
<php
declare(ticks = 1);
$data = file('data.txt');
$max=20;
$child=0;
function sig_handler($signo) {
global $child;
switch ($signo) {
case SIGCHLD:
$child -= 1;
}
}
pcntl_signal(SIGCHLD, "sig_handler");
foreach($data as $dataline){
$dataline = trim($dataline);
while($child >= $max){
sleep(1);
}
$child++;
$pid=pcntl_fork();
if($pid){
// SOMETHING WENT WRONG? NEVER HAPPENS!
}else{
exec("php processdata.php \"$dataline\"");
exit;
}//fork
}
while($child != 0){
sleep(1);
}
?>
After the conversation in the comments, here's how to have Node executing your PHP script.
Since you're calling an external command, there's no need to create a new thread. The Node.js runloop understands that calls to external commands are async operations, and it can execute all of them at the same time.
You can see different ways for executing an external process in this SO question (linked answer may be the best in your case).
However, since you're already moving everything to Node, you may even consider rewriting your "process.php" script to Node.js code. Since, as you explained, that script connects to remote servers and databases and uses nslookup (which you may not really need with Node.js), you won't need any separate thread: they're all async operations that Node.js excels at performing.
I'm kind of newbie to threads in perl.
I have a file with a list of projects (each project is in a separate line), and i want to build those projects in parallel.
Currently, each thread:
opens the file as "read" mode
saves a list of some projects (= some file lines)
closes the file
opens the file again- as "write" mode
rewrites it without the lines that were selected
in order to make sure each thread is the only one to access the file, im trying to use semaphore.
for some reason, threads collisons are occurred, and i can't figure out what am i doing wrong.
i can see (in my "REPORT" which also gets the current time for each build)
that deifferent threads select the same projects from the "shared" file (it happens only once in a while, but still..)
i'm not even sure if my $semaphore decleration is legal as "my" variable.
Any help would be truly appreciated!!
Thanks.
here's a part of my code:
my $semaphore = Thread::semaphore->new() ;
sub build_from_targets_list{
#...
open(REPORT, "+>$REPORT_tmp"); # Open for output
#....
#threads =();
for ($i = 0; $i < $number_of_cores; $i++){
my $thr = threads->new(\&compile_process, $i,*REPORT);
push #threads, $thr;
}
$_->join for #threads;
close (REPORT);
}
### Some stuff..
sub compile_process{
*REPORT = shift(#_);
#...
while (1){
$semaphore->down();
open (DATA_FILE, $shared_file);
flock(DATA_FILE, 2);
while ($record = <DATA_FILE>) {
chomp($record);
push(#temp_target_list,$record);
}
# ... choose some lines (=projects)...
# remove the projects that should be built by this thread:
for ($k = 0; $k < $num_of_targets_in_project; $k++){
shift(#temp_target_list);
}
close(DATA_FILE);
open (REWRITE,">$shared_file");
flock(REWRITE, 2);
seek(REWRITE, 0, 0);
foreach $temp_target (#temp_target_list){
print REWRITE "$temp_target\n";
}
close (REWRITE);
## ... BUILD selected projects...
$semaphore->up();
}
}
First, some basic cleanup of how you're dealing with files. No point in trying to debug a thread problem if it's a simple file issue.
One must check that any file commands (open, close, flock, seek, etc...) succeed. Either stick some or dies on there or use autodie.
Second is the use of a hard coded constant for flock. Those are system dependent, and its hard to remember which mode 2 is. Fcntl provides the constants.
You're opening the data file for reading with an exclusive lock (2 is usually exclusive lock). That should probably be a shared lock. This would be unlikely to cause a problem, but it will cause your threads to block unnecessarily.
Finally, use lexical filehandles instead of a globally scoped glob. This reduces the chance
use Fcntl qw(:flock);
use autodie;
open (my $data_fh, $shared_file);
flock($data_fh, LOCK_SH);
As a side note, the seek $fh, 0, 0 after opening a file for writing is unnecessary. Same goes for seek constants as for flock, use Fcntl to get the constants.
An additional bug is that you're passing in $i, *REPORT but compile_process thinks *REPORT is the first argument. And again the use of global filehandles means that passing it in is redundant, use lexical filehandles.
Now that's out of the way, your basic algorithm seems flawed. compile_process has each thread reading in the whole data file into the thread local array #temp_target_list, shifting some off of that local array and writing the rest out. Because #temp_target_list is per thread, there's no coordination. Unless $num_of_targets_in_project is shared and doing some sort of off screen coordination, but that's not shown.
File based locking is always going to be a little slice of hell. Threads have much better mechanisms for coordination. There's a much easier way to do this.
Assuming the file isn't too large, read each line into a shared array. Then have each thread take items to work on from that array. The array is shared, so as each element is removed the array will update for all the threads. Something like...
use strict;
use warnings;
use autodie;
use threads;
use threads::shared;
my $Max_Threads = 5;
my #Todo : shared;
open my $fh, "<", $work_file;
#Todo = <$fh>;
close $fh;
my #threads;
for (1..$Max_Threads) {
push #threads, threads->new(\&compile_process);
}
$_->join for #threads;
sub compile_process {
while( my $work = shift #Todo ) {
...do whatever with $work...
}
}
If the file is too large to be held in memory, you can use Thread::Queue to build a queue of work items and add to it dynamically.
I am trying to write a persistent/cached script. The code would look something like this:
...
Memoize('process_fille');
print process_file($ARGV[0]);
...
sub process_file{
my $filename = shift;
my ($a, $b, $c) = extract_values_from_file($filename);
if (exists $my_hash{$a}{$b}{$c}){
return $my_hash{$a}{$b}{$c};
}
return $default;
}
Which would be called from a shell script in a loop as follows
value=`perl my_script.pl`;
Is there a way I could call this script in such a way that it will keep its state. from call to call. Lets assume that both initializing '%my_hash' and calling extract_values_from_file is an expensive operation.
Thanks
This is kind of dark magic, but you can store state after your script's __DATA__ token and persist it.
use Data::Dumper; # or JSON, YAML, or any other data serializer
package MyPackage;
my $DATA_ptr;
our $state;
INIT {
$DATA_ptr = tell DATA;
$state = eval join "", <DATA>;
}
...
manipulate $MyPackage::state in this and other scripts
...
END {
open DATA, '+<', $0; # $0 is the name of this script
seek DATA, $DATA_ptr, 0;
print DATA Data::Dumper::Dumper($state);
truncate DATA, tell DATA; # in case new data is shorter than old data
close DATA;
}
__DATA__
$VAR1 = {
'foo' => 123,
'bar' => 42,
...
}
In the INIT block, store the position of the beginning of your file's __DATA__ section and deserialize your state. In the END block, you reserialize the current state and overwrite the __DATA__ section of your script. Of course, the user running the script needs to have write permission on the script.
Edited to use INIT block instead of BEGIN block -- the DATA block is not set up during the compile phase.
If %my_hash in your example have moderate size in its final initialized state, you can simply use one of serialization modules like Storable, JSON::XS or Data::Dumper to keep your data in pre-assembled form between runs. Generate a new file when it is absent and just reload ready content from there when it is present.
Also, you've mentioned that you would call this script in loops. A good strategy would be to not call script right away inside the loop, but build a queue of arguments instead and then pass all of them to script after the loop in single execution. Script would set up its environment and then loop over arguments doing its easy work without need to redo setup steps for each of them.
You can't get the script to keep state. As soon as the process exists any information not written to disk is gone.
There are a few ways you can accomplish this though:
Write a daemon which listens on a network or unix socket. The daemon can populate my_hash and answer questions sent from a very simple my_script.pl. It'd only have to open a connection to the daemon, send the question and return an answer.
Create an efficient look-up file format. If you need the information often it'll probably stay in the VFS cache anyway.
Set up a shared memory region. The first time your scripts starts you save the information there, then re-use it later. That might be tricky from a Perl script though.
No. Not directly but can be achieved by very many ways.
1) I understand **extract_values_from_file()** parses given file returning hash.
2) 1 can be made as a script, then dump the parsed hash using **Data::Dumper** into file.
3) When running my_script.pl, ensure that file generated by 2 is later than of the config file. Can achieve this via **make**
3.1) **use** the file generated by 2 to retrieve values.
The same can be achieved via freeze/thaw