How should I spawn threads for parallel computation? - multithreading

Today, I got into multi-threading. Since it's a new concept, I thought I could begin to learn by translating a simple iteration to a parallelized one. But, I think I got stuck before I even began.
Initially, my loop looked something like this:
let stuff: Vec<u8> = items.into_iter().map(|item| {
some_item_worker(&item)
}).collect();
I had put a reasonably large amount of stuff into items and it took about 0.05 seconds to finish the computation. So, I was really excited to see the time reduction once I successfully implemented multi-threading!
When I used threads, I got into trouble, probably due to my bad reasoning.
use std::thread;
let threads: Vec<_> = items.into_iter().map(|item| {
thread::spawn(move || {
some_item_worker(&item)
})
}).collect(); // yeah, this is followed by another iter() that unwraps the values
I have a quad-core CPU, which means that I can run only up to 4 threads concurrently. I guessed that it worked this way: once the iterator starts, threads are spawned. Whenever a thread ends, another thread begins, so that at any given time, 4 threads run concurrently.
The result was that it took (after some re-runs) ~0.2 seconds to finish the same computation. Clearly, there's no parallel computing going on here. I don't know why the time increased by 4 times, but I'm sure that I've misunderstood something.
Since this isn't the right way, how should I go about modifying the program so that the threads execute concurrently?
EDIT:
I'm sorry, I was wrong about that ~0.2 seconds. I woke up and tried it again, when I noticed that the usual iteration ran for 2 seconds. It turned out that some process had been consuming the memory wildly. When I rebooted my system and tried the threaded iteration again, it ran for about 0.07 seconds. Here are some timings for each run.
Actual iteration (first one):
0.0553760528564 seconds
0.0539519786835 seconds
0.0564560890198 seconds
Threaded one:
0.0734670162201 seconds
0.0727820396423 seconds
0.0719120502472 seconds
I agree that the threads are indeed running concurrently, but it seems to consume another 20 ms to finish the job. My actual goal was to utilize my processor to run threads parallel and finish the job soon. Is this gonna be complicated? What should I do to make those threads run in parallel, not concurrent?

I have a quad-core CPU, which means that I can run only up to 4 threads concurrently.
Only 4 may be running concurrently, but you can certainly create more than 4...
whenever a thread ends, another thread begins, so that at any given time, 4 threads run concurrently (it was just a guess).
Whenever you have a guess, you should create an experiment to figure out if your guess is correct. Here's one:
use std::{thread, time::Duration};
fn main() {
let threads: Vec<_> = (0..500)
.map(|i| {
thread::spawn(move || {
println!("Thread #{i} started!");
thread::sleep(Duration::from_millis(500));
println!("Thread #{i} finished!");
})
})
.collect();
for handle in threads {
handle.join().unwrap();
}
}
If you run this, you will see that "Thread XX started!" is printed out 500 times, followed by 500 "Thread XX finished!"
Clearly, there's no parallel computing going on here
Unfortunately, your question isn't fleshed out enough for us to tell why your time went up. In the example I've provided, it takes a little less than 600 ms, so it's clearly not happening in serial!

Creating a thread has a cost. If the cost of the computation inside the thread is small enough, it'll be dwarfed by the cost of the threads or the inefficiencies caused by the threads.
For example, spawning 10 million threads to double 10 million u8s will probably not be worth it. Vectorizing it would probably yield better performance.
That said, you still might be able to get some improvement through parallelizing cheap tasks. But you want to use fewer threads through a thread pool w/ a small number of threads (so you have a (small) number of threads created at any given point, less CPU contention) or something more sophisticated (under the hood, the api is quite simple) like Rayon.
// Notice `.par_iter()` turns it into a `parallel iterator`
let stuff: Vec<u8> = items.par_iter().map(|item| {
some_item_worker(&item)
}).collect();

Related

How to control multi-threads synchronization in Perl

I got array with [a-z,A-Z] ASCII numbers like so: my #alphabet = (65..90,97..122);
So main thread functionality is checking each character from alphabet and return string if condition is true.
Simple example :
my #output = ();
for my $ascii(#alphabet){
thread->new(\sub{ return chr($ascii); });
}
I want to run thread on every ASCII number, then put letter from thread function into array in the correct order.
So in out case array #output should be dynamic and contain [a..z,A-Z] after all threads finish their job.
How to check, is all threads is done and keep the order?
You're looking for $thread->join, which waits for a thread to finish. It's documented here, and this SO question may also help.
Since in your case it looks like the work being done in the threads is roughly equal in cost (no thread is going to take a long time more than any other), you can just join each thread in order, like so, to wait for them all to finish:
# Store all the threads for each letter in an array.
my #threads = map { thread->new(\sub{ return chr($_); }) } #alphabet;
my #results = map { $_->join } #threads;
Since, when the first thread returns from join, the others are likely already done and just waiting for "join" to grab their return code, or about to be done, this gets you pretty close to "as fast as possible" parallelism-wise, and, since the threads were created in order, #results is ordered already for free.
Now, if your threads can take variable amounts of time to finish, or if you need to do some time-consuming processing in the "main"/spawning thread before plugging child threads' results into the output data structure, joining them in order might not be so good. In that case, you'll need to somehow either: a) detect thread "exit" events as they happen, or b) poll to see which threads have exited.
You can detect thread "exit" events using signals/notifications sent from the child threads to the main/spawning thread. The easiest/most common way to do that is to use the cond_wait and cond_signal functions from threads::shared. Your main thread would wait for signals from child threads, process their output, and store it into the result array. If you take this approach, you should preallocate your result array to the right size, and provide the output index to your threads (e.g. use a C-style for loop when you create your threads and have them return ($result, $index_to_store) or similar) so you can store results in the right place even if they are out of order.
You can poll which threads are done using the is_joinable thread instance method, or using the threads->list(threads::joinable) and threads->list(threads::running) methods in a loop (hopefully not a busy-waiting one; adding a sleep call--even a subsecond one from Time::HiRes--will save a lot of performance/battery in this case) to detect when things are done and grab their results.
Important Caveat: spawning a huge number of threads to perform a lot of work in parallel, especially if that work is small/quick to complete, can cause performance problems, and it might be better to use a smaller number of threads that each do more than one "piece" of work (e.g. spawn a small number of threads, and each thread uses the threads::shared functions to lock and pop the first item off of a shared array of "work to do" and do it rather than map work to threads as 1:1). There are two main performance problems that arise from a 1:1 mapping:
the overhead (in memory and time) of spawning and joining each thread is much higher than you'd think (benchmark it on threads that don't do anything, just return, to see). If the work you need to do is fast, the overhead of thread management for tons of threads can make it much slower than just managing a few re-usable threads.
If you end up with a lot more threads than there are logical CPU cores and each thread is doing CPU-intensive work, or if each thread is accessing the same resource (e.g. reading from the same disks or the same rows in a database), you hit a performance cliff pretty quickly. Tuning the number of threads to the "resources" underneath (whether those are CPUs or hard drives or whatnot) tends to yield much better throughput than trusting the thread scheduler to switch between many more threads than there are available resources to run them on. The reasons this is slow are, very broadly:
Because the thread scheduler (part of the OS, not the language) can't know enough about what each thread is trying to do, so preemptive scheduling cannot optimize for performance past a certain point, given that limited knowledge.
The OS usually tries to give most threads a reasonably fair shot, so it can't reliably say "let one run to completion and then run the next one" unless you explicitly bake that into the code (since the alternative would be unpredictably starving certain threads for opportunities to run). Basically, switching between "run a slice of thread 1 on resource X" and "run a slice of thread 2 on resource X" doesn't get you anything once you have more threads than resources, and adds some overhead as well.
TL;DR threads don't give you performance increases past a certain point, and after that point they can make performance worse. When you can, reuse a number of threads corresponding to available resources; don't create/destroy individual threads corresponding to tasks that need to be done.
Building on Zac B's answer, you can use the following if you want to reuse threads:
use strict;
use warnings;
use Thread::Pool::Simple qw( );
$| = 1;
my $pool = Thread::Pool::Simple->new(
do => [ sub {
select(undef, undef, undef, (200+int(rand(8))*100)/1000);
return chr($_[0]);
} ],
);
my #alphabet = ( 65..90, 97..122 );
print $pool->remove($_) for map { $pool->add($_) } #alphabet;
print "\n";
The results are returned in order, as soon as they become available.
I'm the author of Parallel::WorkUnit so I'm partial to it. And I thought adding ordered responses was actually a great idea. It does it with forks, not threads, because forks are more widely supported and they often perform better in Perl.
my $wu = Parallel::WorkUnit->new();
for my $ascii(#alphabet){
$wu->async(sub{ return chr($ascii); });
}
#output = $wu->waitall();
If you want to limit the number of simultaneous processes:
my $wu = Parallel::WorkUnit->new(max_children => 5);
for my $ascii(#alphabet){
$wu->queue(sub{ return chr($ascii); });
}
#output = $wu->waitall();

Goroutines are cooperatively scheduled. Does that mean that goroutines that don't yield execution will cause goroutines to run one by one?

From: http://blog.nindalf.com/how-goroutines-work/
As the goroutines are scheduled cooperatively, a goroutine that loops continuously can starve other goroutines on the same thread.
Goroutines are cheap and do not cause the thread on which they are multiplexed to block if they are blocked on
network input
sleeping
channel operations or
blocking on primitives in the sync package.
So given the above, say that you have some code like this that does nothing but loop a random number of times and print the sum:
func sum(x int) {
sum := 0
for i := 0; i < x; i++ {
sum += i
}
fmt.Println(sum)
}
if you use goroutines like
go sum(100)
go sum(200)
go sum(300)
go sum(400)
will the goroutines run one by one if you only have one thread?
A compilation and tidying of all of creker's comments.
Preemptive means that kernel (runtime) allows threads to run for a specific amount of time and then yields execution to other threads without them doing or knowing anything. In OS kernels that's usually implemented using hardware interrupts. Process can't block entire OS. In cooperative multitasking thread have to explicitly yield execution to others. If it doesn't it could block whole process or even whole machine. That's how Go does it. It has some very specific points where goroutine can yield execution. But if goroutine just executes for {} then it will lock entire process.
However, the quote doesn't mention recent changes in the runtime. fmt.Println(sum) could cause other goroutines to be scheduled as newer runtimes will call scheduler on function calls.
If you don't have any function calls, just some math, then yes, goroutine will lock the thread until it exits or hits something that could yield execution to others. That's why for {} doesn't work in Go. Even worse, it will still lead to process hanging even if GOMAXPROCS > 1 because of how GC works, but in any case you shouldn't depend on that. It's good to understand that stuff but don't count on it. There is even a proposal to insert scheduler calls in loops like yours
The main thing that Go's runtime does is it gives its best to allow everyone to execute and don't starve anyone. How it does that is not specified in the language specification and might change in the future. If the proposal about loops will be implemented then even without function calls switching could occur. At the moment the only thing you should remember is that in some circumstances function calls could cause goroutine to yield execution.
To explain the switching in Akavall's answer, when fmt.Printf is called, the first thing it does is checks whether it needs to grow the stack and calls the scheduler. It MIGHT switch to another goroutine. Whether it will switch depends on the state of other goroutines and exact implementation of the scheduler. Like any scheduler, it probably checks whether there're starving goroutines that should be executed instead. With many iterations function call has greater chance to make a switch because others are starving longer. With few iterations goroutine finishes before starvation happens.
For what its worth it. I can produce a simple example where it is clear that the goroutines are not ran one by one:
package main
import (
"fmt"
"runtime"
)
func sum_up(name string, count_to int, print_every int, done chan bool) {
my_sum := 0
for i := 0; i < count_to; i++ {
if i % print_every == 0 {
fmt.Printf("%s working on: %d\n", name, i)
}
my_sum += 1
}
fmt.Printf("%s: %d\n", name, my_sum)
done <- true
}
func main() {
runtime.GOMAXPROCS(1)
done := make(chan bool)
const COUNT_TO = 10000000
const PRINT_EVERY = 1000000
go sum_up("Amy", COUNT_TO, PRINT_EVERY, done)
go sum_up("Brian", COUNT_TO, PRINT_EVERY, done)
<- done
<- done
}
Result:
....
Amy working on: 7000000
Brian working on: 8000000
Amy working on: 8000000
Amy working on: 9000000
Brian working on: 9000000
Brian: 10000000
Amy: 10000000
Also if I add a function that just does a forever loop, that will block the entire process.
func dumb() {
for {
}
}
This blocks at some random point:
go dumb()
go sum_up("Amy", COUNT_TO, PRINT_EVERY, done)
go sum_up("Brian", COUNT_TO, PRINT_EVERY, done)
Well, let's say runtime.GOMAXPROCS is 1. The goroutines run concurrently one at a time. Go's scheduler just gives the upper hand to one of the spawned goroutines for a certain time, then to another, etc until all are finished.
So, you never know which goroutine is running at a given time, that's why you need to synchronize your variables. From your example, it's unlikely that sum(100) will run fully, then sum(200) will run fully, etc
The most probable is that one goroutine will do some iterations, then another will do some, then another again etc.
So, the overall is that they are not sequential, even if there is only one goroutine active at a time (GOMAXPROCS=1).
So, what's the advantage of using goroutines ? Plenty. It means that you can just do an operation in a goroutine because it is not crucial and continue the main program. Imagine an HTTP webserver. Treating each request in a goroutine is convenient because you do not have to care about queueing them and run them sequentially: you let Go's scheduler do the job.
Plus, sometimes goroutines are inactive, because you called time.Sleep, or they are waiting for an event, like receiving something for a channel. Go can see this and just executes other goroutines while some are in those idle states.
I know there are a handful of advantages I didn't present, but I don't know concurrency that much to tell you about them.
EDIT:
Related to your example code, if you add each iteration at the end of a channel, run that on one processor and print the content of the channel, you'll see that there is no context switching between goroutines: Each one runs sequentially after another one is done.
However, it is not a general rule and is not specified in the language. So, you should not rely on these results for drawing general conclusions.
#Akavall Try adding sleep after creating dumb goroutine, goruntime never executes sum_up goroutines.
From that it looks like go runtime spawns next go routines immediately, it might execute sum_up goroutine until go runtime schedules dumb() goroutine to run. Once dumb() is scheduled to run then go runtime won't schedule sum_up goroutines to run, as dumb runs for{}

How can I improve performance with FutureTasks

The problem seems simple, I have a number (huge) of operations that I need to work and the main thread can only proceed when all of those operations return their results, however. I tried in one thread only and each operation took about let's say from 2 to 10 seconds at most, and at the end it took about 2,5 minutes. Tried with future tasks and submited them all to the ExecutorService. All of them processed at a time, however each of them took about let's say from 40 to 150 seconds. In the end of the day the full process took about 2,1 minutes.
If I'm right, all the threads were nothing but a way of execute all at once, although sharing processor's power, and what I thought I would get would be the processor working heavily to get me all the tasks executed at the same time taking the same time they take to excecuted in a single thread.
Question is: Is there a way I can reach this? (maybe not with future tasks, maybe with something else, I don't know)
Detail: I don't need them to exactly work at the same time that actually doesn't matter to me what really matters is the performance
You might have created way too many threads. As a consequence, the cpu was constantly switching between them thus generating a noticeable overhead.
You probably need to limit the number of running threads and then you can simply submit your tasks that will execute concurrently.
Something like:
ExecutorService es = Executors.newFixedThreadPool(8);
List<Future<?>> futures = new ArrayList<>(runnables.size());
for(Runnable r : runnables) {
es.submit(r);
}
// wait they all finish:
for(Future<?> f : futures) {
f.get();
}
// all done

How is 999µs too short but 1000µs just right?

When I run the following code, I get some output:
use std::thread::Thread;
static DELAY: i64 = 1000;
fn main() {
Thread::spawn(move || {
println!("some output");
});
std::io::timer::sleep(std::time::duration::Duration::microseconds(DELAY));
}
But if I set DELAY to 999, I get nothing. I think that 999 and 1000 are close enough not to cause such a difference, meaning there must be something else going on here. I've tried also with Duration::nanoseconds (999_999 and 1_000_000), and I see the same behavior.
My platform is Linux and I can reproduce this behavior nearly all the time: using 999 results in some output in way less than 1% of runs.
As a sidenote, I am aware that this approach is wrong.
The sleep function sleeps in increments of 1 millisecond, and if the number of milliseconds is less than 1 it does not sleep at all. Here is the relevant excerpt from the code:
pub fn sleep(&mut self, duration: Duration) {
// Short-circuit the timer backend for 0 duration
let ms = in_ms_u64(duration);
if ms == 0 { return }
self.inner.sleep(ms);
}
In your code, 999 microseconds made it not sleep at all, and the main thread ended before the spawned thread could print its output. With 1000 microseconds, i.e. 1 millisecond, the main thread slept, giving the spawned thread a chance to run.
The most probable thing is you have your kernel configured to have a TICK of 1000Hz (once clock interrupt per millisecond) Perhaps you can improve it recompiling on a finer grained clock or a tickless kernel and recompiling your kernel to allow finer clock resolution. The 1000Hz clock tick is nowadays standard in Linux kernels running on pc (and most of ARMs and embedded Linux).
This is not a newbie issue, so perhaps you'll have to ask for local help to reconfigure and recompile your kernel to cope with more time resolution.

F# Asynch thread problem

I am learning F# and very interested in this language
I try to create async expression to run asynchronously.
for example
let prop1=async{
for i=0 to 1000000 do ()
MessageBox.Show("Done")
}
let prop2=async{
for i=0 to 1000000 do ()
MessageBox.Show("Done2")
}
Async.Start(prop1)
Async.Start(prop2)
when i run the program, i got that there are thread amount increasing of program process, from 6 to 8 , when i done close 2 message box , the process seem not destroy those created threads , the count also 8 , what happened or i got misunderstand about F# asynchronous
Thank for your help
The threads are taken from a thread pool (which is why there are more threads than actions, incidentally).
The pool exists until the application terminates.
Nothing to worry about
Edit For a nice in-depth article on F#, async and ThreadPool: http://www.voyce.com/index.php/2011/05/27/fsharp-async-plays-well-with-others/
The runtime might use a thread pool, that is threads are not destroyed, but waiting for another asynchronous tasks. This technique helps the runtime reduce time to start a new async. operation, because creating a new thread might consume some time and resources.

Resources