Using semaphores while modyifing file in perl - multithreading

I'm kind of newbie to threads in perl.
I have a file with a list of projects (each project is in a separate line), and i want to build those projects in parallel.
Currently, each thread:
opens the file as "read" mode
saves a list of some projects (= some file lines)
closes the file
opens the file again- as "write" mode
rewrites it without the lines that were selected
in order to make sure each thread is the only one to access the file, im trying to use semaphore.
for some reason, threads collisons are occurred, and i can't figure out what am i doing wrong.
i can see (in my "REPORT" which also gets the current time for each build)
that deifferent threads select the same projects from the "shared" file (it happens only once in a while, but still..)
i'm not even sure if my $semaphore decleration is legal as "my" variable.
Any help would be truly appreciated!!
Thanks.
here's a part of my code:
my $semaphore = Thread::semaphore->new() ;
sub build_from_targets_list{
#...
open(REPORT, "+>$REPORT_tmp"); # Open for output
#....
#threads =();
for ($i = 0; $i < $number_of_cores; $i++){
my $thr = threads->new(\&compile_process, $i,*REPORT);
push #threads, $thr;
}
$_->join for #threads;
close (REPORT);
}
### Some stuff..
sub compile_process{
*REPORT = shift(#_);
#...
while (1){
$semaphore->down();
open (DATA_FILE, $shared_file);
flock(DATA_FILE, 2);
while ($record = <DATA_FILE>) {
chomp($record);
push(#temp_target_list,$record);
}
# ... choose some lines (=projects)...
# remove the projects that should be built by this thread:
for ($k = 0; $k < $num_of_targets_in_project; $k++){
shift(#temp_target_list);
}
close(DATA_FILE);
open (REWRITE,">$shared_file");
flock(REWRITE, 2);
seek(REWRITE, 0, 0);
foreach $temp_target (#temp_target_list){
print REWRITE "$temp_target\n";
}
close (REWRITE);
## ... BUILD selected projects...
$semaphore->up();
}
}

First, some basic cleanup of how you're dealing with files. No point in trying to debug a thread problem if it's a simple file issue.
One must check that any file commands (open, close, flock, seek, etc...) succeed. Either stick some or dies on there or use autodie.
Second is the use of a hard coded constant for flock. Those are system dependent, and its hard to remember which mode 2 is. Fcntl provides the constants.
You're opening the data file for reading with an exclusive lock (2 is usually exclusive lock). That should probably be a shared lock. This would be unlikely to cause a problem, but it will cause your threads to block unnecessarily.
Finally, use lexical filehandles instead of a globally scoped glob. This reduces the chance
use Fcntl qw(:flock);
use autodie;
open (my $data_fh, $shared_file);
flock($data_fh, LOCK_SH);
As a side note, the seek $fh, 0, 0 after opening a file for writing is unnecessary. Same goes for seek constants as for flock, use Fcntl to get the constants.
An additional bug is that you're passing in $i, *REPORT but compile_process thinks *REPORT is the first argument. And again the use of global filehandles means that passing it in is redundant, use lexical filehandles.
Now that's out of the way, your basic algorithm seems flawed. compile_process has each thread reading in the whole data file into the thread local array #temp_target_list, shifting some off of that local array and writing the rest out. Because #temp_target_list is per thread, there's no coordination. Unless $num_of_targets_in_project is shared and doing some sort of off screen coordination, but that's not shown.
File based locking is always going to be a little slice of hell. Threads have much better mechanisms for coordination. There's a much easier way to do this.
Assuming the file isn't too large, read each line into a shared array. Then have each thread take items to work on from that array. The array is shared, so as each element is removed the array will update for all the threads. Something like...
use strict;
use warnings;
use autodie;
use threads;
use threads::shared;
my $Max_Threads = 5;
my #Todo : shared;
open my $fh, "<", $work_file;
#Todo = <$fh>;
close $fh;
my #threads;
for (1..$Max_Threads) {
push #threads, threads->new(\&compile_process);
}
$_->join for #threads;
sub compile_process {
while( my $work = shift #Todo ) {
...do whatever with $work...
}
}
If the file is too large to be held in memory, you can use Thread::Queue to build a queue of work items and add to it dynamically.

Related

Limit Execution Time in Node

I am working on a node-based MUD game and I would like to limit the amount of time any one command can execute before it gets killed (e.g. 1000ms). I found a module called Tripwire which seems promising but it does not appear to be actively maintained. Tripwire does work as advertised. It manages to force an exception if someone creates an endless loop, but it does not support any resumption of the original script thread.
I am looking for either:
(1) A similar but actively maintained Node module that can interrupt and resume the original event thread, or,
(2) A working example of V8's Isolate::IsExecutionTerminating + Isolate::CancelTerminateExecution (I forked Tripwire but I haven't done any meaningful C++ in a long time and am now just beating my head against the wall).
I have only been able to find test cases so far (which is at least something). I am really hoping that someone has already tackled this, though.
Test cases:
https://chromium.googlesource.com/v8/v8/+/ad55afcb459dafda1cf48e676985717fd7eae786/test/cctest/test-thread-termination.cc
I know this is a bit vague.
I ended up instrumenting the script by passing it through acorn and generating my own final script. I am hoping that the sandbox is locked down to prevent users from escaping it. Example of "compiled" output:
createPermissions(expr) {
let __mec = __bfc(this || GameMaster, 'public', 'createPermissions', __FILE__, false); try { let parts = expr.split('/');
for (let i = 0; i < parts.length; i++) {
__ala(); let foo = parts.slice(0, i).join('/');
} } finally { __efc(__mec, 'createPermissions'); }
}
This new "language" supports public, protected, package, and private variables/methods (by maintaining its own internal call stack, execution context, etc). The directives are "reserved words" (e.g. __bfc=begin function call, __ala=assert loop alarm).
Thanks #jfriend00 for the suggestion.
For those who are curious: Transpiler Module

Perl - Using LWP in multithread program

Here's my code:
use LWP;
use threads;
use strict;
use warnings;
my #thrds;
for (1..100)
{
push #thrds, threads->create( 'doit' );
}
for (#thrds)
{
$_->join();
}
sub doit
{
LWP::UserAgent->new->get("http://dx.doi.org/10.1002/aoc.1067");
}
I'm using Windows 7 x64 and ActivePerl 5.20.2 x64, I also tried StrawberryPerl.
I've got bunch of errors:
Thread ... terminated abnormally: Can't locate object method "_uric_escape" via package "URI" at .../URI.pm line 81
String found where operator expected at (eval 10) line 8, near "croak 'usage: $io->getlines()'"
(Do you need to predeclare croak?)
If I add
sleep 1;
before
push #thrds, threads->create( 'doit' );
it'll be ok.
What's the problem?
I'm not sure why, but there seems to be problems dealing with dynamically-loaded modules. Explicitly loading them before thread creation solves the problem. In other words, add the following:
use Carp qw( );
use URI qw( );
I think the problem here will be memory footprint. You are - after all - loading the LWP library, and then cloning your process 100 times.
Contrary to popular belief, threads in perl are not even remotely lightweight. They are not well suited to this model of usage - each thread is a 'full copy' of your process, and that's just ... not a great plan.
My copy of perl - ActivePerl 5.20.2 - doesn't exhibit the same problem (I don't think - I didn't actually want to spam that website you list).
I would suggest instead that you either rewrite your threading to use Thread::Queue and a lower degree of parallelism:
use strict;
use warnings;
use LWP;
use threads;
use Thread::Queue;
my $workers = 10;
my $work_q = Thread::Queue->new();
my $url = "http://localhost:80";
sub worker_thread {
while ( my $url = $work_q->dequeue ) {
LWP::UserAgent->new->get($url);
}
}
threads->create( \&worker_thread ) for 1 .. $workers;
for ( 1 .. 100 ) {
$work_q->enqueue($url);
}
$work_q->end;
foreach my $thread ( threads->list ) {
$thread->join();
}
Otherwise, a fork-based approach may also work better:
use strict;
use warnings;
use Parallel::ForkManager;
use LWP;
my $manager = Parallel::ForkManager->new(10);
for ( 1 .. 100 ) {
$manager->start and next;
LWP::UserAgent->new->get("http://localhost:80");
$manager->finish;
}
$manager->wait_all_children;
Edit:
A bit more testing on your sample - I do get similar runtime errors on RHEL running 5.20.2
They vary somewhat, which really does mean there has to be some kind of race condition going on here.
Which is odd, because threads are supposed to be standalone and they're not.
In particular - my kernel bombs with a 'kill' because of memory exhaustion thanks to a few gigs of memory footprint, which is a pretty good reason not to use this approach.
My test case of running your code with RHEL and Perl 5.20.2 also triggers problems (occasionally).
What I get is similar sorts of errors - sporadically. I cannot see an obvious source thought. It might be as simple as too many open file descriptors or too much memory consumed. This is quite a hefty memory burden.
This is very straightforward using the Mojolicious framework. The Mojo::UserAgent class has been designed to work asynchronously with the help of the Mojo::IOLoop module, and although synchronous transactions are available they are implemented using special cases of the standard asynchronous calls
use strict;
use warnings;
use Mojo;
my $ua = Mojo::UserAgent->new( max_redirects => 5 );
my $n;
for ( 1 .. 100 ) {
$ua->get('http://dx.doi.org/10.1002/aoc.1067' => \&completed);
++$n;
}
Mojo::IOLoop->start;
sub completed {
my ($ua, $tx) = #_;
Mojo::IOLoop->stop unless --$n > 0;
}
A quick benchmark gave the following results
100 synchronous GET requests took 178 seconds
100 asynchronous GET requests took 10 seconds
I know this is an old topic, but just yesterday I ran into this problem. Below info can be useful for someone.
I was getting several kinds of errors from LWP in similar code (parallel http requests), including
"String found where operator expected at ... near "croak ..."
"Deep recursion on subroutine "IO::Socket::new" ... Out of memory!"
Sometimes requests just failed (returned status 500 or 501).
Problem seems to be gone after I added the following line:
use IO::File;
Don't ask me why, I have no idea :) Figured this out accidentally.
I am using Strawberry Perl 5.24.0.

node.js multithreading with max child count

I need to write a script, that takes an array of values and multithreaded way it (forks?) runs another script with a value from array as a param, but so max running forks would be set, so it would wait for script to finish if there are more than n running already. How do I do that?
There is a plugin named child_process, but not sure how to get it done, as it always waits for child termination.
Basically, in PHP it would be something like this (wrote it from head, may contain some syntax errors):
<php
declare(ticks = 1);
$data = file('data.txt');
$max=20;
$child=0;
function sig_handler($signo) {
global $child;
switch ($signo) {
case SIGCHLD:
$child -= 1;
}
}
pcntl_signal(SIGCHLD, "sig_handler");
foreach($data as $dataline){
$dataline = trim($dataline);
while($child >= $max){
sleep(1);
}
$child++;
$pid=pcntl_fork();
if($pid){
// SOMETHING WENT WRONG? NEVER HAPPENS!
}else{
exec("php processdata.php \"$dataline\"");
exit;
}//fork
}
while($child != 0){
sleep(1);
}
?>
After the conversation in the comments, here's how to have Node executing your PHP script.
Since you're calling an external command, there's no need to create a new thread. The Node.js runloop understands that calls to external commands are async operations, and it can execute all of them at the same time.
You can see different ways for executing an external process in this SO question (linked answer may be the best in your case).
However, since you're already moving everything to Node, you may even consider rewriting your "process.php" script to Node.js code. Since, as you explained, that script connects to remote servers and databases and uses nslookup (which you may not really need with Node.js), you won't need any separate thread: they're all async operations that Node.js excels at performing.

Writing a persistent perl script

I am trying to write a persistent/cached script. The code would look something like this:
...
Memoize('process_fille');
print process_file($ARGV[0]);
...
sub process_file{
my $filename = shift;
my ($a, $b, $c) = extract_values_from_file($filename);
if (exists $my_hash{$a}{$b}{$c}){
return $my_hash{$a}{$b}{$c};
}
return $default;
}
Which would be called from a shell script in a loop as follows
value=`perl my_script.pl`;
Is there a way I could call this script in such a way that it will keep its state. from call to call. Lets assume that both initializing '%my_hash' and calling extract_values_from_file is an expensive operation.
Thanks
This is kind of dark magic, but you can store state after your script's __DATA__ token and persist it.
use Data::Dumper; # or JSON, YAML, or any other data serializer
package MyPackage;
my $DATA_ptr;
our $state;
INIT {
$DATA_ptr = tell DATA;
$state = eval join "", <DATA>;
}
...
manipulate $MyPackage::state in this and other scripts
...
END {
open DATA, '+<', $0; # $0 is the name of this script
seek DATA, $DATA_ptr, 0;
print DATA Data::Dumper::Dumper($state);
truncate DATA, tell DATA; # in case new data is shorter than old data
close DATA;
}
__DATA__
$VAR1 = {
'foo' => 123,
'bar' => 42,
...
}
In the INIT block, store the position of the beginning of your file's __DATA__ section and deserialize your state. In the END block, you reserialize the current state and overwrite the __DATA__ section of your script. Of course, the user running the script needs to have write permission on the script.
Edited to use INIT block instead of BEGIN block -- the DATA block is not set up during the compile phase.
If %my_hash in your example have moderate size in its final initialized state, you can simply use one of serialization modules like Storable, JSON::XS or Data::Dumper to keep your data in pre-assembled form between runs. Generate a new file when it is absent and just reload ready content from there when it is present.
Also, you've mentioned that you would call this script in loops. A good strategy would be to not call script right away inside the loop, but build a queue of arguments instead and then pass all of them to script after the loop in single execution. Script would set up its environment and then loop over arguments doing its easy work without need to redo setup steps for each of them.
You can't get the script to keep state. As soon as the process exists any information not written to disk is gone.
There are a few ways you can accomplish this though:
Write a daemon which listens on a network or unix socket. The daemon can populate my_hash and answer questions sent from a very simple my_script.pl. It'd only have to open a connection to the daemon, send the question and return an answer.
Create an efficient look-up file format. If you need the information often it'll probably stay in the VFS cache anyway.
Set up a shared memory region. The first time your scripts starts you save the information there, then re-use it later. That might be tricky from a Perl script though.
No. Not directly but can be achieved by very many ways.
1) I understand **extract_values_from_file()** parses given file returning hash.
2) 1 can be made as a script, then dump the parsed hash using **Data::Dumper** into file.
3) When running my_script.pl, ensure that file generated by 2 is later than of the config file. Can achieve this via **make**
3.1) **use** the file generated by 2 to retrieve values.
The same can be achieved via freeze/thaw

How can I modify my perl script to use multiple processors?

Hi I have a simple script that takes a file and runs another Perl script on it. The script does this to every picture file in the current folder. This is running on a machine with 2 quad core Xeon processors, 16gb of ram, running RedHat Linux.
The first script work.pl basically calls magicplate.pl passes some parameters and the name of the file for magicplate.pl to process. Magic Plate takes about a minute to process each image. Because work.pl is preforming the same function over 100 times and because the system has multiple processors and cores I was thinking about splitting the task up so that it could run multiple times in parallel. I could split the images up to different folders if necessary. Any help would be great. Thank you
Here is what I have so far:
use strict;
use warnings;
my #initialImages = <*>;
foreach my $file (#initialImages) {
if($file =~ /.png/){
print "processing $file...\n";
my #tmp=split(/\./,$file);
my $name="";
for(my $i=0;$i<(#tmp-1);$i++) {
if($name eq "") { $name = $tmp[$i]; } else { $name=$name.".".$tmp[$i];}
}
my $exten=$tmp[(#tmp-1)];
my $orig=$name.".".$exten;
system("perl magicPlate.pl -i ".$orig." -min 4 -max 160 -d 1");
}
}
You should consider NOT creating a new process for each file that you want to process -- It's horribly inefficient, and probably what is taking most of your time here. Just loading up Perl and whatever modules you use over and over ought to be creating some overhead. I recall a poster on PerlMonks that did something similar, and ended up transforming his second script into a module, reducing the worktime from an hour to a couple of minutes. Not that you should expect such a dramatic improvement, but one can dream..
With the second script refactored as a module, here's an example of thread usage, in which BrowserUK creates a thread pool, feeding it jobs through a queue.
You could use Parallel::ForkManager (set $MAX_PROCESSES to the number of files processed at the same time):
use Parallel::ForkManager;
use strict;
use warnings;
my #initialImages = <*>;
foreach my $file (#initialImages) {
if($file =~ /.png/){
print "processing $file...\n";
my #tmp=split(/\./,$file);
my $name="";
for(my $i=0;$i<(#tmp-1);$i++) {
if($name eq "") { $name = $tmp[$i]; } else { $name=$name.".".$tmp[$i];}
}
my $exten=$tmp[(#tmp-1)];
my $orig=$name.".".$exten;
$pm = new Parallel::ForkManager($MAX_PROCESSES);
my $pid = $pm->start and next;
system("perl magicPlate.pl -i ".$orig." -min 4 -max 160 -d 1");
$pm->finish; # Terminates the child process
}
}
But as suggested by Hugmeir running perl interpreter again and again for each new file is not a good idea.
Import "maigcplate" and use threading.
Start magicplate.pl in the background (you would need to add process throttling)
Import "magicplate" and use fork (add process throttling and a kiddy reaper)
Make "maigcplate" a daemon with a pool of workers = # of CPUs
use an MQ implementation for communication
use sockets for communication
Use webserver(nginx, apache, ...) and wrap in REST for a webservice
etc...
All these center around creating multiple workers that can each run on their own cpu. Certain implementations will use resources better (those that don't start a new process) and be easier to implement and maintain.

Resources