How to speed up Parallel::ForkManager in perl - multithreading

I am using EC2 amazon server to perform data processing of 63 files,
the server i am using has 16 core but using perl Parallel::ForkManager with number of thread = number of core then it seems like half the core are sleeping and the working core are not at 100% and fluctuate around 25%~50%
I also checked IO and it is mostly iddling.
use Sys::Info;
use Sys::Info::Constants qw( :device_cpu );
my $info = Sys::Info->new;
my $cpu = $info->device( CPU => %options );
use Parallel::ForkManager;
my $manager=new Parallel::ForkManager($cpu->count);
for($i=0;$i<=$#files_l;$i++)
{
$manager->start and next;
do_stuff($files_l[$i]);
$manager->finish;
}
$manager->wait_all_children;

The short answer is - we can't tell you, because it depends entirely on what 'do_stuff' is doing.
The major reasons why parallel code doesn't create linear speed increases are:
Process creation overhead - some 'work' is done to spawn a process, so if the children are trivially small, that 'wastes' effort.
Contented resources - the most common is disk IO, but things like file locks, database handles, sockets, or interprocess communication can also play a part.
something else causing a 'back off' that stalls a process.
And without knowing what 'do_stuff' does, we can't second guess what it might be.
However I'll suggest a couple of steps:
Double the number of processes to twice CPU count. That's often a 'sweet spot' because it means that any non-CPU delay in a process just means one of the others get to go full speed.
Try strace -fTt <yourprogram> (if you're on linux, the commands are slightly different on other Unix variants). Then do it again with strace -fTtc because the c will summarise syscall run times. Look at which ones take the most 'time'.
Profile your code to see where the hot spots are. Devel::NYTProf is one library you can use for this.
And on a couple of minor points:
my $manager=new Parallel::ForkManager($cpu->count);
Would be better off written:
my $manager=Parallel::ForkManager -> new ( $cpu->count);
Rather than using indirect object notation.
If you are just iterating #files then it might be better to not use a loop count variable and instead:
foreach my $file ( #files ) {
$manager -> start and next;
do_stuff($file);
$manager -> finish;
}

Related

How to control multi-threads synchronization in Perl

I got array with [a-z,A-Z] ASCII numbers like so: my #alphabet = (65..90,97..122);
So main thread functionality is checking each character from alphabet and return string if condition is true.
Simple example :
my #output = ();
for my $ascii(#alphabet){
thread->new(\sub{ return chr($ascii); });
}
I want to run thread on every ASCII number, then put letter from thread function into array in the correct order.
So in out case array #output should be dynamic and contain [a..z,A-Z] after all threads finish their job.
How to check, is all threads is done and keep the order?
You're looking for $thread->join, which waits for a thread to finish. It's documented here, and this SO question may also help.
Since in your case it looks like the work being done in the threads is roughly equal in cost (no thread is going to take a long time more than any other), you can just join each thread in order, like so, to wait for them all to finish:
# Store all the threads for each letter in an array.
my #threads = map { thread->new(\sub{ return chr($_); }) } #alphabet;
my #results = map { $_->join } #threads;
Since, when the first thread returns from join, the others are likely already done and just waiting for "join" to grab their return code, or about to be done, this gets you pretty close to "as fast as possible" parallelism-wise, and, since the threads were created in order, #results is ordered already for free.
Now, if your threads can take variable amounts of time to finish, or if you need to do some time-consuming processing in the "main"/spawning thread before plugging child threads' results into the output data structure, joining them in order might not be so good. In that case, you'll need to somehow either: a) detect thread "exit" events as they happen, or b) poll to see which threads have exited.
You can detect thread "exit" events using signals/notifications sent from the child threads to the main/spawning thread. The easiest/most common way to do that is to use the cond_wait and cond_signal functions from threads::shared. Your main thread would wait for signals from child threads, process their output, and store it into the result array. If you take this approach, you should preallocate your result array to the right size, and provide the output index to your threads (e.g. use a C-style for loop when you create your threads and have them return ($result, $index_to_store) or similar) so you can store results in the right place even if they are out of order.
You can poll which threads are done using the is_joinable thread instance method, or using the threads->list(threads::joinable) and threads->list(threads::running) methods in a loop (hopefully not a busy-waiting one; adding a sleep call--even a subsecond one from Time::HiRes--will save a lot of performance/battery in this case) to detect when things are done and grab their results.
Important Caveat: spawning a huge number of threads to perform a lot of work in parallel, especially if that work is small/quick to complete, can cause performance problems, and it might be better to use a smaller number of threads that each do more than one "piece" of work (e.g. spawn a small number of threads, and each thread uses the threads::shared functions to lock and pop the first item off of a shared array of "work to do" and do it rather than map work to threads as 1:1). There are two main performance problems that arise from a 1:1 mapping:
the overhead (in memory and time) of spawning and joining each thread is much higher than you'd think (benchmark it on threads that don't do anything, just return, to see). If the work you need to do is fast, the overhead of thread management for tons of threads can make it much slower than just managing a few re-usable threads.
If you end up with a lot more threads than there are logical CPU cores and each thread is doing CPU-intensive work, or if each thread is accessing the same resource (e.g. reading from the same disks or the same rows in a database), you hit a performance cliff pretty quickly. Tuning the number of threads to the "resources" underneath (whether those are CPUs or hard drives or whatnot) tends to yield much better throughput than trusting the thread scheduler to switch between many more threads than there are available resources to run them on. The reasons this is slow are, very broadly:
Because the thread scheduler (part of the OS, not the language) can't know enough about what each thread is trying to do, so preemptive scheduling cannot optimize for performance past a certain point, given that limited knowledge.
The OS usually tries to give most threads a reasonably fair shot, so it can't reliably say "let one run to completion and then run the next one" unless you explicitly bake that into the code (since the alternative would be unpredictably starving certain threads for opportunities to run). Basically, switching between "run a slice of thread 1 on resource X" and "run a slice of thread 2 on resource X" doesn't get you anything once you have more threads than resources, and adds some overhead as well.
TL;DR threads don't give you performance increases past a certain point, and after that point they can make performance worse. When you can, reuse a number of threads corresponding to available resources; don't create/destroy individual threads corresponding to tasks that need to be done.
Building on Zac B's answer, you can use the following if you want to reuse threads:
use strict;
use warnings;
use Thread::Pool::Simple qw( );
$| = 1;
my $pool = Thread::Pool::Simple->new(
do => [ sub {
select(undef, undef, undef, (200+int(rand(8))*100)/1000);
return chr($_[0]);
} ],
);
my #alphabet = ( 65..90, 97..122 );
print $pool->remove($_) for map { $pool->add($_) } #alphabet;
print "\n";
The results are returned in order, as soon as they become available.
I'm the author of Parallel::WorkUnit so I'm partial to it. And I thought adding ordered responses was actually a great idea. It does it with forks, not threads, because forks are more widely supported and they often perform better in Perl.
my $wu = Parallel::WorkUnit->new();
for my $ascii(#alphabet){
$wu->async(sub{ return chr($ascii); });
}
#output = $wu->waitall();
If you want to limit the number of simultaneous processes:
my $wu = Parallel::WorkUnit->new(max_children => 5);
for my $ascii(#alphabet){
$wu->queue(sub{ return chr($ascii); });
}
#output = $wu->waitall();

perl fork() doesn't seem to utilize the cores, but cpu only

I've got a perl program that trying to do conversion of a bunch of files from one format to another (via a command-line tool). It works fine, but too slow as it's converting the files one and a time.
I researched and utilize the fork() mechanism trying to spawn off all the conversion as child-forks hoping to utilize the cpu/cores.
Coding is done and tested, it does improve performance, but not to the way I expected.
When looking at /proc/cpuinfo, I have this:
> egrep -e "core id" -e ^physical /proc/cpuinfo|xargs -l2 echo|sort -u
physical id : 0 core id : 0
physical id : 0 core id : 1
physical id : 0 core id : 2
physical id : 0 core id : 3
physical id : 1 core id : 0
physical id : 1 core id : 1
physical id : 1 core id : 2
physical id : 1 core id : 3
That means I have 2 CPU and quad-core each? If so, I should able to fork out 8 forks and supposingly I should able to make a 8-min job (1min per file, 8 files) to finish in 1-min (8 forks, 1 file per fork).
However, when I test run this, it still take 4-min to finish. It appears like it only utilized 2 CPUs, but not the cores?
Hence, my question is:
Is it true that perl's fork() only parallel it based on CPUs, but not cores? Or maybe I didn't do it right? I'm simply using fork() and wait(). Nothing special.
I'd assume perl's fork() should be using cores, is there a simple bash/perl that I can write to prove my OS (i.e. RedHat 4) nor Perl is the culprit for such symptom?
To Add:
I even tried running the following command multiple times to simulate multiple processing and monitor htop.
while true; do echo abc >>devnull; done &
Somehow htop is telling me I've got 16 cores? and then when I spawn 4 of the above while-loop, I see 4 of them utilizing ~100% cpu each. When I spawn more, all of them start reducing the cpu utilization percentage evenly. (e.g. 8 processing, see 8 bash in htop, but using ~50% each) Does this mean something?
Thanks ahead. I tried google around but not able to find an obvious answer.
Edit: 2016-11-09
Here is the extract of perl code. I'm interested to see what I did wrong here.
my $maxForks = 50;
my $forks = 0;
while(<CIFLIST>) {
extractPDFByCIF($cifNumFromIndex, $acctTypeFromIndex, $startDate, $endDate);
}
for (1 .. $forks) {
my $pid = wait();
print "Child fork exited. PID=$pid\n";
}
sub extractPDFByCIF {
# doing SQL constructing to for the $stmt to do a DB query
$stmt->execute();
while ($stmt->fetch()) {
# fork the copy/afp2web process into child process
if ($forks >= $maxForks) {
my $pid = wait();
print "PARENTFORK: Child fork exited. PID=$pid\n";
$forks--;
}
my $pid = fork;
if (not defined $pid) {
warn "PARENTFORK: Could not fork. Do it sequentially with parent thread\n";
}
if ($pid) {
$forks++;
print "PARENTFORK: Spawned child fork number $forks. PID=$pid\n";
}else {
print "CHILDFORK: Processing child fork. PID=$$\n";
# prevent child fork to destroy dbh from parent thread
$dbh->{InactiveDestroy} = 1;
undef $dbh;
# perform the conversion as usual
if($fileName =~ m/.afp/){
system("file-conversion -parameter-list");
} elsif($fileName =~ m/.pdf/) {
system("cp $from-file $to-file");
} else {
print ERRORLOG "Problem happened here\r\n";
}
exit;
}
# end forking
$stmt->finish();
close(INDEX);
}
fork() spawns a new process - identical to, and with the same state as the existing one. No more, no less. The kernel schedules it and runs it wherever.
If you do not get the results you're expecting, I would suggest that a far more likely limiting factor is that you are reading files from your disk subsystem - disks are slow, and contending for IO isn't actually making them any faster - if anything the opposite, because it forces additional drive seeks and less easy caching.
So specifically:
1/ No, fork() does nothing more than clone your process.
2/ Largely meaningless unless you want to rewrite most of your algorithm as a shell script. There's no real reason to think that it'll be any different though.
To follow on from your edit:
system('file-conversion') looks an awful lot like an IO based process, which will be limited by your disk IO. As will your cp.
Have you considered Parallel::ForkManager which greatly simplifies the forking bit?
As a lesser style point, you should probably use 3 arg 'open'.
#!/usr/bin/env perl
use strict;
use warnings;
use Parallel::ForkManager;
my $maxForks = 50;
my $manager = Parallel::ForkManager->new($maxForks);
while ($ciflist) {
## do something with $_ to parse.
##instead of: extractPDFByCIF($cifNumFromIndex, $acctTypeFromIndex, $startDate, $endDate);
# doing SQL constructing to for the $stmt to do a DB query
$stmt->execute();
while ( $stmt->fetch() ) {
# fork the copy/afp2web process into child process
$manager->start and next;
print "CHILDFORK: Processing child fork. PID=$$\n";
# prevent child fork to destroy dbh from parent thread
$dbh->{InactiveDestroy} = 1;
undef $dbh;
# perform the conversion as usual
if ( $fileName =~ m/.afp/ ) {
system("file-conversion -parameter-list");
} elsif ( $fileName =~ m/.pdf/ ) {
system("cp $from-file $to-file");
} else {
print ERRORLOG "Problem happened here\r\n";
}
# end forking
$manager->finish;
}
$stmt->finish();
}
$manager->wait_all_children;
Your goal is to parallelize your application in a way that is using multiple cores as independent resources. What you want to achieve is multi-threading, in particular Perl's ithreads that are using calls to the fork() function of the underlying system (and are heavy-weight for that reason). You can teach the Perl way of multi-threading to yourself from perlthrtut. Quote from perlthrtut:
When a new Perl thread is created, all the data associated with the current thread is copied to the new thread, and is subsequently private to that new thread! This is similar in feel to what happens when a Unix process forks, except that in this case, the data is just copied to a different part of memory within the same process rather than a real fork taking place.
That being said, regarding your questions:
You're not doing it right (sorry). [see my comment...] With multi-threading you don't need to call fork() by yourself for that, but Perl will do it for you.
You can check whether your Perl interpreter has been built with thread support e.g. by perl -V (note the capital V) and looking at the messages. If there is nothing to see about threads then your Perl interpreter is not capable of Perl-multithreading.
The reason that your application is already faster even with only one CPU core by using fork() is likely that while one process has to wait for slow resources such as the file system another process can use the same core as a computation resource in the meantime.

How To Prevent Perl Thread Printouts from Intercepting?

Suppose I made three perl threads that each do this:
print hi I am thread $threadnum!;
print this is a test!;
you would expect an output of:
hi I am thread 1
this is a test!
hi I am thread 2
this is a test!
hi I am thread 3
this is a test!
but instead, what happens about half of the time is:
hi I am thread 1
this is a test!
hi I am thread2
hi I am thread3
this is a test!
this is a test!
is there any was to ensure that they output in the correct order WITHOUT condensing it all into one line?
First of all: don't use perl interpreter threads.
That being said, to prevent those lines from being printed separately, your options are either:
Acquire a semaphore while printing these two lines to prevent two threads from entering that critical section at the same time.
Print a single line of text with an embedded newline, e.g:
print "hi I am thread $threadnum\nthis is a test!\n";
Fundamentally here, you're misunderstanding what threads do. They're designed to operate in parallel and asynchronously.
That means different threads hit different bits of the program at different times, and potentially run on different processors.
One of the drawbacks of this is- as you have found - you cannot guarantee ordering or atomicity of operations. That's true of compound operations too - you can't actually guarantee that even a print statement is an atomic operation - you can end up with split lines.
You should always assume that any operation isn't atomic, unless you know for sure otherwise, and lock accordingly. Most of the time you will get away with it, but you can find yourself tripping over some truly horrific and hard to find bugs, because in a small percentage of cases, your non-atomic operations are interfering with each other. Even things like ++ may not be. (This isn't a problem on variables local to your thread though, just any time you're interacting with a shared resource, like a file, STDOUT, shared variable, etc.)
This is a very common problem with parallel programming though, and so there are a number of solutions:
Use lock and a shared variable:
##outside threads:
use threads::shared;
my $lock : shared;
And inside the threads:
{
lock $lock;
### do atomic operation
}
When the lock leaves scope, it'll be released. A thread will 'block' waiting to obtain that lock, so for that one bit, you are no longer running parallel.
Use a Thread::Semaphore
Much like a lock - you have the Thread::Semaphore module, that ... in your case works the same. But it's built around limited (but more than 1) resources. I wouldn't use this in your scenario, but it can be useful if you're trying to e.g. limit concurrent disk IO and concurrent processor usage - set a semaphore:
use Thread::Semaphore;
my $limit = Thread::Semaphore -> new ( 8 );
And inside the threads:
$limit -> down();
#do protected bit
$limit -> up();
You can of course, set that 8 to 1, but you don't gain that much over lock. (Just the ability to up() to remove it, rather than letting it go out of scope).
Use a 'IO handler' thread with Thread::Queue
(In forking, you might use a pipe here).
use Thread::Queue;
my $output = Thread::Queue -> new ();
sub print_output_thread {
while ( $output -> dequeue ) {
print;
}
}
threads -> create ( \&output_thread );
And in your threads, rather than print you would use:
$output -> enqueue ( "Print this message \n" );
This thread serialises your output, and will ensure each message is atomic - but note that if you do two enqueue operations, they might be interleaved again, for exactly the same reason.
So you would need to;
$output -> enqueue ( "Print this message\n", "And this message too\n" );
(you can also lock your queue as in the first example). That's again, perhaps a bit overkill for your example, but it can be useful if you're trying to collate results into a particular ordering.

Run Perl Script on Several files Simultaneously

I have written a Perl script that reads in a data file, line by line, does some computations and returns 3 files as outputs; I have also written it so that it reads through every *.csv file that I have in my directory, one file at the time, returning the 3 separate output files for each input file (so for 10 csv input files, when my script is done, I have 30 output files.)
However, when I run my script, I see that it only runs on one core. What I would like to do is make my script run simultaneously on several input files: is this even possible? Or, alternatively, what would be a better option? I'm working on a Windows machine.
There are two (main) options for using more processors in Perl.
Threads and forks. There's a certain amount of similarity between them, but some crucial differences. fork() is a native system call on Unix, and it's extremely efficient (it's used a lot). On Windows you don't have it - perl emulates it's functionality quite well though.
fork clones your program exactly - it makes a 'parent' and 'child' and the only difference is the return code from fork. Codes resumes from exactly the same point, so you can end up with some slightly strange behaviour.
Note for example when you run:
#!/usr/bin/perl
use strict;
use warnings;
my $pid = fork();
if ( $pid ) {
print "$$ is the parent - child is $pid\n";
}
else {
print "$$ is the child\n";
}
You should be aware - every variable that exists before is still defined in each 'fork' but it's a separate copy. This leads you to the next challenge, which is inter-process communications. This is a sufficiently big topic that it's got it's own perl documentation page perlipc
When it comes to more paralleism though, fork can somewhat awkward because it's a low level call. How many lines do you think this prints?
#!/usr/bin/perl
use strict;
use warnings;
my #fruits = qw ( apple pear lemon lime cucumber );
foreach my $fruit ( #fruits ) {
my $pid = fork();
if ( $pid ) {
print "Parent $$ with a child of $pid has a fruit of $fruit\n";
}
else {
print "Child of $$ has a fruit of $fruit\n";
}
}
Because the fork is nested, it happens more times than you might intuitively guess. It's also somewhat easy to fork too much with a loop, and you can create a denial of service condition.
Fortunately, there is a solution - Parallel::Forkmanager implements some simple mechanisms for controlling forks, which makes it a lot smoother.
#!/usr/bin/perl
use strict;
use warnings;
use Parallel::ForkManager;
my #fruits = qw ( apple pear lemon lime cucumber );
my $manager = Parallel::ForkManager -> new ( 5 );
print "Parent: $$\n";
foreach my $fruit ( #fruits ) {
$manager -> start and next;
print "Child of $$ - $fruit\n";
$manager -> finish;
}
$manager -> wait_all_children;
For the sake of completeness - I'll also mention threads. They're another way of doing things, but they're slightly counter-intuitively not lightweight as they are in other languages. They're also a status of 'discouraged':
The "interpreter-based threads" provided by Perl are not the fast, lightweight system for multitasking that one might expect or hope for. Threads are implemented in a way that make them easy to misuse. Few people know how to use them correctly or will be able to provide help.
The use of interpreter-based threads in perl is officially discouraged.
So where forks are fairly easy to have lots and lots of, threads are basically best thought of as separate processes.
#!/usr/bin/perl
use strict;
use warnings;
use threads;
sub thread_sub {
print threads -> self -> tid(). ": #_\n";
}
my #fruits = qw ( apple pear lemon lime cucumber );
foreach my $fruit ( #fruits ) {
threads -> create ( \&thread_sub, $fruit );
}
foreach my $thr ( threads -> list ) {
$thr -> join;
}
In either case though, you should be aware - parallel processing means your code no longer happens in an obvious sequential fashion. This means you'll have some really fruity and funky bugs that will be horrific to debug if you're not careful. So make sure you code works sequentially first before even trying to go near parallelism.
You should also be aware - you only get linear performance improvements for as long as your limiting factor is purely CPU. It generally isn't. Disk IO is invariably much slower. You mention processing several files. If the emphasis is on processing, rather than reading data - then parallelism will help.
But disks are pretty slow, and 'thrashing' them by trying to stream data from multiple locations will slow them down more. So you don't gain much - if anything - by parallising IO intensive tasks (disk traversals, bulk file reads etc.), and you can quite easily make matters worse.

multithreading or forking in perl

in my perl script I'm collecting a large data and later I need it to post to server, up to this I'm good but my criteria is that post to server takes subsequently large time so I need to a threading / forking concept so that one will post and parallely I can dig my second data set at same time while posting to server is taking place.
code snippet
if(system("curl -sS $post_url --data-binary \#$filename -H 'Content-type:text/xml;charset=utf-8' 1>/dev/null") != 0)
{
exit_script(" xml: Error ","Unable to update $filename xml on $post_url");
}
can any one tell me is this achievable with threading or forking.
It's difficult to give an answer to your question, because it depends.
Yes, Perl supports both forking and threading.
In general, I would suggest looking at threading for data-oriented tasks, and forking for almost anything else.
And so what you want to so is eminently achievable.
First you need to:
Encapsulate your tasks into subroutines. Get that working first. (This is very important - parallel stuff causes worlds of pain and is difficult to troubleshoot if you're not careful - get it working single threaded first).
Run your subroutines as threads, and capture their results.
Something like this:
use threads;
sub curl_update {
my $result = system ( "you_curl_command" );
return $result;
}
#start the async curl
my $thr = threads -> create ( \&curl_update );
#do your other stuff....
sleep ( 60 );
my $result = $thr -> join();
if ( $result ) {
#do whatever you would if the curl update failed
}
In this, the join is a blocking call - your main code will stop and wait for your thread to complete. If you want to do something more complicated, you can use is_running or is_joinable which are non blocking.
I'd suggest neither.
You're just talking lots of HTTP. You can talk concurrent HTTP a lot nicer, because it's just network IO, by using any of the asynchronous IO systems. Perl has many of them.
Principally I'd suggest IO::Async, but then I wrote it. You can use Net::Async::HTTP to make an HTTP hit. This will fully support doing many of them at once - many hundreds or thousands if need be.
Otherwise, you can also try either POE or AnyEvent, which will both support the same thing in their own way.

Resources