How To Prevent Perl Thread Printouts from Intercepting? - multithreading

Suppose I made three perl threads that each do this:
print hi I am thread $threadnum!;
print this is a test!;
you would expect an output of:
hi I am thread 1
this is a test!
hi I am thread 2
this is a test!
hi I am thread 3
this is a test!
but instead, what happens about half of the time is:
hi I am thread 1
this is a test!
hi I am thread2
hi I am thread3
this is a test!
this is a test!
is there any was to ensure that they output in the correct order WITHOUT condensing it all into one line?

First of all: don't use perl interpreter threads.
That being said, to prevent those lines from being printed separately, your options are either:
Acquire a semaphore while printing these two lines to prevent two threads from entering that critical section at the same time.
Print a single line of text with an embedded newline, e.g:
print "hi I am thread $threadnum\nthis is a test!\n";

Fundamentally here, you're misunderstanding what threads do. They're designed to operate in parallel and asynchronously.
That means different threads hit different bits of the program at different times, and potentially run on different processors.
One of the drawbacks of this is- as you have found - you cannot guarantee ordering or atomicity of operations. That's true of compound operations too - you can't actually guarantee that even a print statement is an atomic operation - you can end up with split lines.
You should always assume that any operation isn't atomic, unless you know for sure otherwise, and lock accordingly. Most of the time you will get away with it, but you can find yourself tripping over some truly horrific and hard to find bugs, because in a small percentage of cases, your non-atomic operations are interfering with each other. Even things like ++ may not be. (This isn't a problem on variables local to your thread though, just any time you're interacting with a shared resource, like a file, STDOUT, shared variable, etc.)
This is a very common problem with parallel programming though, and so there are a number of solutions:
Use lock and a shared variable:
##outside threads:
use threads::shared;
my $lock : shared;
And inside the threads:
{
lock $lock;
### do atomic operation
}
When the lock leaves scope, it'll be released. A thread will 'block' waiting to obtain that lock, so for that one bit, you are no longer running parallel.
Use a Thread::Semaphore
Much like a lock - you have the Thread::Semaphore module, that ... in your case works the same. But it's built around limited (but more than 1) resources. I wouldn't use this in your scenario, but it can be useful if you're trying to e.g. limit concurrent disk IO and concurrent processor usage - set a semaphore:
use Thread::Semaphore;
my $limit = Thread::Semaphore -> new ( 8 );
And inside the threads:
$limit -> down();
#do protected bit
$limit -> up();
You can of course, set that 8 to 1, but you don't gain that much over lock. (Just the ability to up() to remove it, rather than letting it go out of scope).
Use a 'IO handler' thread with Thread::Queue
(In forking, you might use a pipe here).
use Thread::Queue;
my $output = Thread::Queue -> new ();
sub print_output_thread {
while ( $output -> dequeue ) {
print;
}
}
threads -> create ( \&output_thread );
And in your threads, rather than print you would use:
$output -> enqueue ( "Print this message \n" );
This thread serialises your output, and will ensure each message is atomic - but note that if you do two enqueue operations, they might be interleaved again, for exactly the same reason.
So you would need to;
$output -> enqueue ( "Print this message\n", "And this message too\n" );
(you can also lock your queue as in the first example). That's again, perhaps a bit overkill for your example, but it can be useful if you're trying to collate results into a particular ordering.

Related

How to control multi-threads synchronization in Perl

I got array with [a-z,A-Z] ASCII numbers like so: my #alphabet = (65..90,97..122);
So main thread functionality is checking each character from alphabet and return string if condition is true.
Simple example :
my #output = ();
for my $ascii(#alphabet){
thread->new(\sub{ return chr($ascii); });
}
I want to run thread on every ASCII number, then put letter from thread function into array in the correct order.
So in out case array #output should be dynamic and contain [a..z,A-Z] after all threads finish their job.
How to check, is all threads is done and keep the order?
You're looking for $thread->join, which waits for a thread to finish. It's documented here, and this SO question may also help.
Since in your case it looks like the work being done in the threads is roughly equal in cost (no thread is going to take a long time more than any other), you can just join each thread in order, like so, to wait for them all to finish:
# Store all the threads for each letter in an array.
my #threads = map { thread->new(\sub{ return chr($_); }) } #alphabet;
my #results = map { $_->join } #threads;
Since, when the first thread returns from join, the others are likely already done and just waiting for "join" to grab their return code, or about to be done, this gets you pretty close to "as fast as possible" parallelism-wise, and, since the threads were created in order, #results is ordered already for free.
Now, if your threads can take variable amounts of time to finish, or if you need to do some time-consuming processing in the "main"/spawning thread before plugging child threads' results into the output data structure, joining them in order might not be so good. In that case, you'll need to somehow either: a) detect thread "exit" events as they happen, or b) poll to see which threads have exited.
You can detect thread "exit" events using signals/notifications sent from the child threads to the main/spawning thread. The easiest/most common way to do that is to use the cond_wait and cond_signal functions from threads::shared. Your main thread would wait for signals from child threads, process their output, and store it into the result array. If you take this approach, you should preallocate your result array to the right size, and provide the output index to your threads (e.g. use a C-style for loop when you create your threads and have them return ($result, $index_to_store) or similar) so you can store results in the right place even if they are out of order.
You can poll which threads are done using the is_joinable thread instance method, or using the threads->list(threads::joinable) and threads->list(threads::running) methods in a loop (hopefully not a busy-waiting one; adding a sleep call--even a subsecond one from Time::HiRes--will save a lot of performance/battery in this case) to detect when things are done and grab their results.
Important Caveat: spawning a huge number of threads to perform a lot of work in parallel, especially if that work is small/quick to complete, can cause performance problems, and it might be better to use a smaller number of threads that each do more than one "piece" of work (e.g. spawn a small number of threads, and each thread uses the threads::shared functions to lock and pop the first item off of a shared array of "work to do" and do it rather than map work to threads as 1:1). There are two main performance problems that arise from a 1:1 mapping:
the overhead (in memory and time) of spawning and joining each thread is much higher than you'd think (benchmark it on threads that don't do anything, just return, to see). If the work you need to do is fast, the overhead of thread management for tons of threads can make it much slower than just managing a few re-usable threads.
If you end up with a lot more threads than there are logical CPU cores and each thread is doing CPU-intensive work, or if each thread is accessing the same resource (e.g. reading from the same disks or the same rows in a database), you hit a performance cliff pretty quickly. Tuning the number of threads to the "resources" underneath (whether those are CPUs or hard drives or whatnot) tends to yield much better throughput than trusting the thread scheduler to switch between many more threads than there are available resources to run them on. The reasons this is slow are, very broadly:
Because the thread scheduler (part of the OS, not the language) can't know enough about what each thread is trying to do, so preemptive scheduling cannot optimize for performance past a certain point, given that limited knowledge.
The OS usually tries to give most threads a reasonably fair shot, so it can't reliably say "let one run to completion and then run the next one" unless you explicitly bake that into the code (since the alternative would be unpredictably starving certain threads for opportunities to run). Basically, switching between "run a slice of thread 1 on resource X" and "run a slice of thread 2 on resource X" doesn't get you anything once you have more threads than resources, and adds some overhead as well.
TL;DR threads don't give you performance increases past a certain point, and after that point they can make performance worse. When you can, reuse a number of threads corresponding to available resources; don't create/destroy individual threads corresponding to tasks that need to be done.
Building on Zac B's answer, you can use the following if you want to reuse threads:
use strict;
use warnings;
use Thread::Pool::Simple qw( );
$| = 1;
my $pool = Thread::Pool::Simple->new(
do => [ sub {
select(undef, undef, undef, (200+int(rand(8))*100)/1000);
return chr($_[0]);
} ],
);
my #alphabet = ( 65..90, 97..122 );
print $pool->remove($_) for map { $pool->add($_) } #alphabet;
print "\n";
The results are returned in order, as soon as they become available.
I'm the author of Parallel::WorkUnit so I'm partial to it. And I thought adding ordered responses was actually a great idea. It does it with forks, not threads, because forks are more widely supported and they often perform better in Perl.
my $wu = Parallel::WorkUnit->new();
for my $ascii(#alphabet){
$wu->async(sub{ return chr($ascii); });
}
#output = $wu->waitall();
If you want to limit the number of simultaneous processes:
my $wu = Parallel::WorkUnit->new(max_children => 5);
for my $ascii(#alphabet){
$wu->queue(sub{ return chr($ascii); });
}
#output = $wu->waitall();

How to speed up Parallel::ForkManager in perl

I am using EC2 amazon server to perform data processing of 63 files,
the server i am using has 16 core but using perl Parallel::ForkManager with number of thread = number of core then it seems like half the core are sleeping and the working core are not at 100% and fluctuate around 25%~50%
I also checked IO and it is mostly iddling.
use Sys::Info;
use Sys::Info::Constants qw( :device_cpu );
my $info = Sys::Info->new;
my $cpu = $info->device( CPU => %options );
use Parallel::ForkManager;
my $manager=new Parallel::ForkManager($cpu->count);
for($i=0;$i<=$#files_l;$i++)
{
$manager->start and next;
do_stuff($files_l[$i]);
$manager->finish;
}
$manager->wait_all_children;
The short answer is - we can't tell you, because it depends entirely on what 'do_stuff' is doing.
The major reasons why parallel code doesn't create linear speed increases are:
Process creation overhead - some 'work' is done to spawn a process, so if the children are trivially small, that 'wastes' effort.
Contented resources - the most common is disk IO, but things like file locks, database handles, sockets, or interprocess communication can also play a part.
something else causing a 'back off' that stalls a process.
And without knowing what 'do_stuff' does, we can't second guess what it might be.
However I'll suggest a couple of steps:
Double the number of processes to twice CPU count. That's often a 'sweet spot' because it means that any non-CPU delay in a process just means one of the others get to go full speed.
Try strace -fTt <yourprogram> (if you're on linux, the commands are slightly different on other Unix variants). Then do it again with strace -fTtc because the c will summarise syscall run times. Look at which ones take the most 'time'.
Profile your code to see where the hot spots are. Devel::NYTProf is one library you can use for this.
And on a couple of minor points:
my $manager=new Parallel::ForkManager($cpu->count);
Would be better off written:
my $manager=Parallel::ForkManager -> new ( $cpu->count);
Rather than using indirect object notation.
If you are just iterating #files then it might be better to not use a loop count variable and instead:
foreach my $file ( #files ) {
$manager -> start and next;
do_stuff($file);
$manager -> finish;
}

Does join in perl threads block SIGALRM?

I have a small sample program that hangs on perl 5.16.3. I am attempting to use an alarm to trigger if two threads don't finish working in time in a much more complicated program, but this boils down the gist of it. I know there's plenty of other ways to do this, but for the sake of argument, let's say I'm stuck with the code the way it is. I'm not sure if this is a bug in perl, or something that legitimately shouldn't work.
I have researched this on the Internet, and it seems like mixing alarms and threads is generally discouraged, but I've seen plenty of examples where people claim that this is a perfectly reasonable thing to do, such as this other SO question, Perl threads with alarm. The code provided in the accepted answer on that question also hangs on my system, which is why I'm wondering if maybe this is something that's now broke, at least as of 5.16.3.
It appears that in the code below, if I call join before the alarm goes off, the alarm never triggers. If I replace the join with while(1){} and go into a busy-wait loop, then the alarm goes off just fine, so it appears that join is blocking the SIGALRM for some reason.
My expectation is that the join happens, and then a few seconds later I see "Alarm!" printed on the screen, but this never happens, so long as that join gets called before the alarm goes off.
#!/usr/bin/env perl
use strict;
use warnings;
use threads;
sub worker {
print "Worker thread started.\n";
while(1){}
}
my $thread = threads->create(\&worker);
print "Setting alarm.\n";
$SIG{ALRM} = sub { print "Alarm!\n" };
alarm 2;
print "Joining.\n";
$thread->join();
The problem has nothing to do with threads. Signals are only processed between Perl ops, and join is written in C, so the signal will only be handled when join returns. The following demonstrates this:
#!/usr/bin/env perl
use strict;
use warnings;
use threads;
sub worker {
print "Worker thread started.\n";
for (1..5) {
sleep(1);
print(".\n");
}
}
my $thread = threads->create(\&worker);
print "Setting alarm.\n";
$SIG{ALRM} = sub { print "Alarm!\n" };
alarm 2;
print "Joining.\n";
$thread->join();
Output:
Setting alarm.
Joining.
Worker thread started.
.
.
.
.
.
Alarm!
join is essentially a call to pthread_join. Unlike other blocking system calls, pthread_join does not get interrupted by signals.
By the way, I renamed $tid to $thread since threads->create returns a thread object, not a thread id.
I'm going to post an answer to my own question to add some detail to ikegami's response above, and summarize our conversation, which should save future visitors from having to read through the huge comment trail it collected.
After discussing things with ikegami, I went and did some more reading on perl signals, consulted some other perl experts, and discovered the exact reason why join isn't being "interrupted" by the interpreter. As ikegami said, signals only get delivered in between perl operations. In perl, this is called Deferred Signals, or Safe Signals.
Deferred Signals were released in 5.8.0, back in 2002, which could be one of the reasons I was seeing older posts on the Net which don't appear to work. They probably worked with "unsafe signals", which act more like signal delivery that we're used to in C. In fact, as of 5.8.1, you can turn off deferred signal delivery by setting the environment variable PERL_SIGNALS=unsafe before executing your script. When I do this, the threads::join call is indeed interrupted as I was expecting, just as pthread_join is interrupted in C in this same scenario.
Unlike other I/O operations, like read, which returns EINTR when a signal interrupts it, threads::join doesn't do this. Under the hood it's a call to the C library call pthread_join, which the man page confirms does not return EINTR. Under deferred signals, when the interpreter gets the SIGALRM, it schedules delivery of the signal, deferring it, until the threads::join->pthread_join library call returns. Since pthread_join doesn't "interrupt" and return EINTR, my SIGALRM is effectively being swallowed by the threads::join. With other I/O operations, they would "interrupt" and return EINTR, giving the perl interpreter a chance to deliver the signal and then restart the system call via SA_RESTART.
Obviously, running in unsafe signals mode is probably a Bad Thing, so as an alternative, according to perlipc, you can use the POSIX module to install a signal handler directly via sigaction. This then makes the one particular signal "unsafe".

Goroutines are cooperatively scheduled. Does that mean that goroutines that don't yield execution will cause goroutines to run one by one?

From: http://blog.nindalf.com/how-goroutines-work/
As the goroutines are scheduled cooperatively, a goroutine that loops continuously can starve other goroutines on the same thread.
Goroutines are cheap and do not cause the thread on which they are multiplexed to block if they are blocked on
network input
sleeping
channel operations or
blocking on primitives in the sync package.
So given the above, say that you have some code like this that does nothing but loop a random number of times and print the sum:
func sum(x int) {
sum := 0
for i := 0; i < x; i++ {
sum += i
}
fmt.Println(sum)
}
if you use goroutines like
go sum(100)
go sum(200)
go sum(300)
go sum(400)
will the goroutines run one by one if you only have one thread?
A compilation and tidying of all of creker's comments.
Preemptive means that kernel (runtime) allows threads to run for a specific amount of time and then yields execution to other threads without them doing or knowing anything. In OS kernels that's usually implemented using hardware interrupts. Process can't block entire OS. In cooperative multitasking thread have to explicitly yield execution to others. If it doesn't it could block whole process or even whole machine. That's how Go does it. It has some very specific points where goroutine can yield execution. But if goroutine just executes for {} then it will lock entire process.
However, the quote doesn't mention recent changes in the runtime. fmt.Println(sum) could cause other goroutines to be scheduled as newer runtimes will call scheduler on function calls.
If you don't have any function calls, just some math, then yes, goroutine will lock the thread until it exits or hits something that could yield execution to others. That's why for {} doesn't work in Go. Even worse, it will still lead to process hanging even if GOMAXPROCS > 1 because of how GC works, but in any case you shouldn't depend on that. It's good to understand that stuff but don't count on it. There is even a proposal to insert scheduler calls in loops like yours
The main thing that Go's runtime does is it gives its best to allow everyone to execute and don't starve anyone. How it does that is not specified in the language specification and might change in the future. If the proposal about loops will be implemented then even without function calls switching could occur. At the moment the only thing you should remember is that in some circumstances function calls could cause goroutine to yield execution.
To explain the switching in Akavall's answer, when fmt.Printf is called, the first thing it does is checks whether it needs to grow the stack and calls the scheduler. It MIGHT switch to another goroutine. Whether it will switch depends on the state of other goroutines and exact implementation of the scheduler. Like any scheduler, it probably checks whether there're starving goroutines that should be executed instead. With many iterations function call has greater chance to make a switch because others are starving longer. With few iterations goroutine finishes before starvation happens.
For what its worth it. I can produce a simple example where it is clear that the goroutines are not ran one by one:
package main
import (
"fmt"
"runtime"
)
func sum_up(name string, count_to int, print_every int, done chan bool) {
my_sum := 0
for i := 0; i < count_to; i++ {
if i % print_every == 0 {
fmt.Printf("%s working on: %d\n", name, i)
}
my_sum += 1
}
fmt.Printf("%s: %d\n", name, my_sum)
done <- true
}
func main() {
runtime.GOMAXPROCS(1)
done := make(chan bool)
const COUNT_TO = 10000000
const PRINT_EVERY = 1000000
go sum_up("Amy", COUNT_TO, PRINT_EVERY, done)
go sum_up("Brian", COUNT_TO, PRINT_EVERY, done)
<- done
<- done
}
Result:
....
Amy working on: 7000000
Brian working on: 8000000
Amy working on: 8000000
Amy working on: 9000000
Brian working on: 9000000
Brian: 10000000
Amy: 10000000
Also if I add a function that just does a forever loop, that will block the entire process.
func dumb() {
for {
}
}
This blocks at some random point:
go dumb()
go sum_up("Amy", COUNT_TO, PRINT_EVERY, done)
go sum_up("Brian", COUNT_TO, PRINT_EVERY, done)
Well, let's say runtime.GOMAXPROCS is 1. The goroutines run concurrently one at a time. Go's scheduler just gives the upper hand to one of the spawned goroutines for a certain time, then to another, etc until all are finished.
So, you never know which goroutine is running at a given time, that's why you need to synchronize your variables. From your example, it's unlikely that sum(100) will run fully, then sum(200) will run fully, etc
The most probable is that one goroutine will do some iterations, then another will do some, then another again etc.
So, the overall is that they are not sequential, even if there is only one goroutine active at a time (GOMAXPROCS=1).
So, what's the advantage of using goroutines ? Plenty. It means that you can just do an operation in a goroutine because it is not crucial and continue the main program. Imagine an HTTP webserver. Treating each request in a goroutine is convenient because you do not have to care about queueing them and run them sequentially: you let Go's scheduler do the job.
Plus, sometimes goroutines are inactive, because you called time.Sleep, or they are waiting for an event, like receiving something for a channel. Go can see this and just executes other goroutines while some are in those idle states.
I know there are a handful of advantages I didn't present, but I don't know concurrency that much to tell you about them.
EDIT:
Related to your example code, if you add each iteration at the end of a channel, run that on one processor and print the content of the channel, you'll see that there is no context switching between goroutines: Each one runs sequentially after another one is done.
However, it is not a general rule and is not specified in the language. So, you should not rely on these results for drawing general conclusions.
#Akavall Try adding sleep after creating dumb goroutine, goruntime never executes sum_up goroutines.
From that it looks like go runtime spawns next go routines immediately, it might execute sum_up goroutine until go runtime schedules dumb() goroutine to run. Once dumb() is scheduled to run then go runtime won't schedule sum_up goroutines to run, as dumb runs for{}

multithreading: how to process data in a vector, while the vector is being populated?

I have a single-threaded linux app which I would like to make parallel. It reads a data file, creates objects, and places them in a vector. Then it calls a compute-intensive method (.5 second+) on each object. I want to call the method in parallel with object creation. While I've looked at qt and tbb, I am open to other options.
I planned to start the thread(s) while the vector was empty. Each one would call makeSolids (below), which has a while loop that would run until interpDone==true and all objects in the vector have been processed. However, I'm a n00b when it comes to threading, and I've been looking for a ready-made solution.
QtConcurrent::map(Iter begin,Iter end,function()) looks very easy, but I can't use it on a vector that's changing in size, can I? And how would I tell it to wait for more data?
I also looked at intel's tbb, but it looked like my main thread would halt if I used parallel_for or parallel_while. That stinks, since their memory manager was recommended (open cascade's mmgt has poor performance when multithreaded).
/**intended to be called by a thread
\param start the first item to get from the vector
\param skip how many to skip over (4 for 4 threads)
*/
void g2m::makeSolids(uint start, uint incr) {
uint curr = start;
while ((!interpDone) || (lineVector.size() > curr)) {
if (lineVector.size() > curr) {
if (lineVector[curr]->isMotion()) {
((canonMotion*)lineVector[curr])->setSolidMode(SWEPT);
((canonMotion*)lineVector[curr])->computeSolid();
}
lineVector[curr]->setDispMode(BEST);
lineVector[curr]->display();
curr += incr;
} else {
uio::sleep(); //wait a little bit for interp
}
}
}
EDIT: To summarize, what's the simplest way to process a vector at the same time that the main thread is populating the vector?
Firstly, to benefit from threading you need to find similarly slow tasks for each thread to do. You said your per-object processing takes .5s+, how long does your file reading / object creation take? It could easily be a tenth or a thousandth of that time, in which case your multithreading approach is going to produce neglegible benefit. If that's the case, (yes, I'll answer your original question soon incase it's not) then think about simultaneously processing multiple objects. Given your processing takes quite a while, the thread creation overhead isn't terribly significant, so you could simply have your main file reading/object creation thread spawn a new thread and direct it at the newly created object. The main thread then continues reading/creating subsequent objects. Once all objects are read/created, and all the processing threads launched, the main thread "joins" (waits for) the worker threads. If this will create too many threads (thousands), then put a limit on how far ahead the main thread is allowed to get: it might read/create 10 objects then join 5, then read/create 10, join 10, read/create 10, join 10 etc. until finished.
Now, if you really want the read/create to be in parallel with the processing, but the processing to be serialised, then you can still use the above approach but join after each object. That's kind of weird if you're designing this with only this approach in mind, but good because you can easily experiment with the object processing parallelism above as well.
Alternatively, you can use a more complex approach that just involves the main thread (that the OS creates when your program starts), and a single worker thread that the main thread must start. They should be coordinated using a mutex (a variable ensuring mutually-exclusive, which means not-concurrent, access to data), and a condition variable which allows the worker thread to efficiently block until the main thread has provided more work. The terms - mutex and condition variable - are the standard terms in the POSIX threading that Linux uses, so should be used in the explanation of the particular libraries you're interested in. Summarily, the worker thread waits until the main read/create thread broadcasts it a wake-up signal indicating another object is ready for processing. You may want to have a counter with index of the last fully created, ready-for-processing object, so the worker thread can maintain it's count of processed objects and move along the ready ones before once again checking the condition variable.
It's hard to tell if you have been thinking about this problem deeply and there is more than you are letting on, or if you are just over thinking it, or if you are just wary of threading.
Reading the file and creating the objects is fast; the one method is slow. The dependency is each consecutive ctor depends on the outcome of the previous ctor - a little odd - but otherwise there are no data integrity issues so there doesn't seem to be anything that needs to be protected by mutexes and such.
Why is this more complicated than something like this (in crude pseudo-code):
while (! eof)
{
readfile;
object O(data);
push_back(O);
pthread_create(...., O, makeSolid);
}
while(x < vector.size())
{
pthread_join();
x++;
}
If you don't want to loop on the joins in your main then spawn off a thread to wait on them by passing a vector of TIDs.
If the number of created objects/threads is insane, use a thread pool. Or put a counter is the creation loop to limit the number of threads that can be created before running ones are joined.
#Caleb: quite -- perhaps I should have emphasized active threads. The GUI thread should always be considered one.

Resources