Multiple Rayon `spawn` with parallel iterators inside

Multiple Rayon `spawn` with parallel iterators inside - multithreading

I'm trying to speed up CPU-heavy computations in an async environment (tokio). My initial solution was to use rayon::spawn with a parallel iterator within to actually perform. A minimal example with pseudo-code:
for data in vec![data1, data2, data3] {
rayon::spawn(move || {
iproduct!(...)
.par_bridge()
.for_each(|...| process(data))
})
}
This must be wrong, because when as is above, all tasks take 2h to run but if I remove .par_bridge() then the first task only takes 20min (as expected, since the first set of data is much smaller) while the rest take upwards of 4h.
I hope this isn't a case of XY problem. What am I doing wrong?

Related

Is there an alternative to Arc<> wrapping for long-running threads which share data?

I have some threads, which are long running, they are fed by a Deque, which has data pushed into it by another long running thread. Currently, I'm using std::thread::spawn, and have to wrap the Deque in an Arc<> to share it between the threads. If I use &deque, I run into the classic 'static lifetime issue, hence the Arc<>. I've looked at scoped threads, however, the closure which the threads run it won't return for a very long time, so I don't think that will work for this case. Is anyone aware of an alternative solution -- short of using Unsafe? I'm not satisfied with the Arc<> solution. Each time I touch the Deque the code digs into Arc<>'s inner to get to the Deque, incurring overhead which I'd like to avoid. I've also considered making the Deque static, however it would need to be a lazy static due to the allocation restriction on static, and that comes with its own access overhead.

You can get a &Dequq out of the Arc<Deque> just once at the beginning of your long-running thread and keep using that immutable reference throughout its life. Something like this:
let dq: Arc<Deque<T>> = ....;
....
{
let dq2 = Arc::clone(dq);
thread.spawn(move || {
let dq_ref: &Deque<T> = *dq2;
// long-running calculation using dq_ref
// dq2 is dropped
});
}

What happens when a declarator (my/state) is in a for block?

The following blocks run a loop assigning the topic to a variable $var:
The first one the my $var; is outside the loop
The second the my $var; is inside the loop
Lastly the state $var; is inside the loop
my $limit=10_000_000;
{
my $var;
for ^$limit { $var =$_; }
say now - ENTER now;
}
{
for ^$limit { my $var; $var=$_; }
say now - ENTER now;
}
{
for ^$limit { state $var; $var=$_; }
say now - ENTER now;
}
A sample output durations (seconds) of each block are as follows:
0.5938845
1.8251226
2.60700803
The docs at https://docs.perl6.org/syntax/state motion state variables have the same lexical scoping as my. Functionally code block 1 and block 3 would achieve the same persistent storage across multiple calls to the respective loop block.
Why does the state ( and the inner my) version take so much more time? What else is it doing?
Edit:
Similar to #HåkonHægland's comment,if I cut and paste the above code so to run each block three times in total the timing changes significantly for the my $var outside the loop(the first case):
0.600303
1.7917011
2.6640811
1.67793597
1.79197091
2.6816156
1.795679
1.81233942
2.77486777

Short version: in a world without any runtime optimization (type specialization, JIT, and so forth), the timings would match your expectations. The timings here are influenced by how well the optimizer deals with each example.
First of all, it's interesting to run the code without any kind of runtime optimization. In my (rather slow) VM on the box I'm currently on, sticking MVM_SPESH_DISABLE=1 into the environment results in these timings:
13.92366942
16.235372
14.4329288
These make some kind of intuitive sense:
In the first case, we have a simple lexical variable declared in the outer scope of the block
In the second case, we have to allocate, and then garbage collect, an extra Scalar allocation every time around the loop, which accounts for the extra time
In the third case, we're using the state variable. A state variable is stored in the code object of the closure, and then copied into the call frame at entry time. That's cheaper than allocating a new Scalar every time, but still a little bit more work than not having to do that operation at all.
Next, let's run 3 programs with the optimizer enabled, each example in its own isolated program.
The first comes out at 0.86298831, a factor of 16 faster. Go optimizer! It has inlined the loop body.
The second comes out at 1.2288566, a factor of 13 faster. Not too shabby either. It has again inlined the loop body. (This case will also become rather cheaper in the future, once the escape analyzer is smart enough to eliminate the Scalar allocation.)
The third comes out at 2.0695035, a factor of 7 faster. That's comparatively unimpressive (even if still quite an improvement), and the major reason is that it has not inlined the loop body. Why? Because it doesn't know how to inline code that uses state variables yet. (How to see this: run with MVM_SPESH_INLINE_LOG=1 in the environment, and among the output is: Can NOT inline (1) with bytecode size 78 into (3): cannot inline code that declares a state variable.)
In short, the dominating factor here is the inlining of the loop body, and with state variables that is presently not possible.
It's not immediately clear why the optimizer does worse at the case with the outer declaration of $var when that isn't the first loop in the program; that feels more like a bug than a reasonable case of "this feature isn't optimized well yet". In its slight defense, it still consistently manages to deliver a big improvement, even when not so big as might be desired!

How to parallely `map(...)` on a custom, single-threaded iterator in Rust?

I have a MyReader that implements Iterator and produces Buffers where Buffer : Send. MyReader produces a lot of Buffers very quickly, but I have a CPU-intensive job to perform on each Buffer (.map(|buf| ...)) that is my bottleneck, and then gather the results (ordered). I want to parallelize the CPU intense work - hopefully to N threads, that would use work stealing to perform them as fast as the number of cores allows.
Edit: To be more precise. I am working on rdedup. MyStruct is Chunker which reads io::Read (typically stdio), finds parts (chunks) of data and yields them. Then map() is supposed, for each chunk, to calculate sha256 digest of it, compress, encrypt, save and return the digest as the result of map(...). Digest of saved data is used to build index of the data. The order between chunks being processed by map(...) does not matter, but digest returned from each map(...) needs to be collected in the same order that the chunks were found. The actual save to file step is offloaded to yet another thread (writter thread). actual code of PR in question
I hoped I can use rayon for this, but rayon expect an iterator that is already parallizable - eg. a Vec<...> or something like that. I have found no way to get a par_iter from MyReader - my reader is very single-threaded in nature.
There is simple_parallel but documentation says it's not recommended for general use. And I want to make sure everything will just work.
I could just take a spmc queue implementation and a custom thread_pool, but I was hopping for an existing solution that is optimized and tested.
There's also pipeliner but doesn't support ordered map yet.

In general, preserving order is a pretty tough requirement as far as parallelization goes.
You could try to hand-make it with a typical fan-out/fan-in setup:
a single producer which tags inputs with a sequential monotonically increasing ID,
a thread pool which consumes from this producer and then sends the result toward the final consumer,
a consumer who buffers and reorders result so as to treat them in the sequential order.
Or you could raise the level of abstraction.
Of specific interest here: Future.
A Future represents the result of a computation, which may or may not have happened yet. A consumer receiving an ordered list of Future can simply wait on each one, and let buffering occur naturally in the queue.
For bonus points, if you use a fixed size queue, you automatically get back-pressure on the consumer.
And therefore I would recommend building something of CpuPool.
The setup is going to be:
use std::sync::mpsc::{Receiver, Sender};
fn produce(sender: Sender<...>) {
let pool = CpuPool::new_num_cpus();
for chunk in reader {
let future = pool.spawn_fn(|| /* do work */);
sender.send(future);
}
// Dropping the sender signals there's no more work to consumer
}
fn consume(receiver: Receiver<...>) {
while let Ok(future) = receiver.recv() {
let item = future.wait().expect("Computation Error?");
/* do something with item */
}
}
fn main() {
let (sender, receiver) = std::sync::mpsc::channel();
std::thread::spawn(move || consume(receiver));
produce(sender);
}

There is now a dpc-pariter crate. Simply replace iter.map(fn) with iter.parallel_map(fn), which will perform work in parallel while preserving result order. From the docs:
* drop-in replacement for standard iterators(*)
* preserves order
* lazy, somewhat like single-threaded iterators
* panic propagation
* support for iterating over borrowed values using scoped threads
* backpressure
* profiling methods (useful for analyzing pipelined processing bottlenecks)
Also, Rayon has an open issue with a great in-depth discussion of various implementation details and limitations.

I want to know about the multi thread with future on Scala

I know multi thread with future a little such as :
for(i <- 1 to 5) yield future {
println(i)
}
but this is all the threads do same work.
So, i want to know how to make two threads which do different work concurrently.
Also, I want to know is there any method to know all the thread is complete?
Please, give me something simple.

First of all, chances are you might be happy with parallel collections, especially if all you need is to crunch some data in parallel using multiple threads:
val lines = Seq("foo", "bar", "baz")
lines.par.map(line => line.length)
While parallel collections suitable for finite datasets, Futures are more oriented towards events-like processing and in fact, future defines task, abstracting away from execution details (one thread, multiple threads, how particular task is pinned to thread) -- all of this is controlled with execution context. What you can do with futures though is to add callback (on success, on failure, on both), compose it with another future or await for result. All this concepts are nicely explained in official doc which is worthwhile reading.

How should I spawn threads for parallel computation?

Today, I got into multi-threading. Since it's a new concept, I thought I could begin to learn by translating a simple iteration to a parallelized one. But, I think I got stuck before I even began.
Initially, my loop looked something like this:
let stuff: Vec<u8> = items.into_iter().map(|item| {
some_item_worker(&item)
}).collect();
I had put a reasonably large amount of stuff into items and it took about 0.05 seconds to finish the computation. So, I was really excited to see the time reduction once I successfully implemented multi-threading!
When I used threads, I got into trouble, probably due to my bad reasoning.
use std::thread;
let threads: Vec<_> = items.into_iter().map(|item| {
thread::spawn(move || {
some_item_worker(&item)
})
}).collect(); // yeah, this is followed by another iter() that unwraps the values
I have a quad-core CPU, which means that I can run only up to 4 threads concurrently. I guessed that it worked this way: once the iterator starts, threads are spawned. Whenever a thread ends, another thread begins, so that at any given time, 4 threads run concurrently.
The result was that it took (after some re-runs) ~0.2 seconds to finish the same computation. Clearly, there's no parallel computing going on here. I don't know why the time increased by 4 times, but I'm sure that I've misunderstood something.
Since this isn't the right way, how should I go about modifying the program so that the threads execute concurrently?
EDIT:
I'm sorry, I was wrong about that ~0.2 seconds. I woke up and tried it again, when I noticed that the usual iteration ran for 2 seconds. It turned out that some process had been consuming the memory wildly. When I rebooted my system and tried the threaded iteration again, it ran for about 0.07 seconds. Here are some timings for each run.
Actual iteration (first one):
0.0553760528564 seconds
0.0539519786835 seconds
0.0564560890198 seconds
Threaded one:
0.0734670162201 seconds
0.0727820396423 seconds
0.0719120502472 seconds
I agree that the threads are indeed running concurrently, but it seems to consume another 20 ms to finish the job. My actual goal was to utilize my processor to run threads parallel and finish the job soon. Is this gonna be complicated? What should I do to make those threads run in parallel, not concurrent?

I have a quad-core CPU, which means that I can run only up to 4 threads concurrently.
Only 4 may be running concurrently, but you can certainly create more than 4...
whenever a thread ends, another thread begins, so that at any given time, 4 threads run concurrently (it was just a guess).
Whenever you have a guess, you should create an experiment to figure out if your guess is correct. Here's one:
use std::{thread, time::Duration};
fn main() {
let threads: Vec<_> = (0..500)
.map(|i| {
thread::spawn(move || {
println!("Thread #{i} started!");
thread::sleep(Duration::from_millis(500));
println!("Thread #{i} finished!");
})
})
.collect();
for handle in threads {
handle.join().unwrap();
}
}
If you run this, you will see that "Thread XX started!" is printed out 500 times, followed by 500 "Thread XX finished!"
Clearly, there's no parallel computing going on here
Unfortunately, your question isn't fleshed out enough for us to tell why your time went up. In the example I've provided, it takes a little less than 600 ms, so it's clearly not happening in serial!

Creating a thread has a cost. If the cost of the computation inside the thread is small enough, it'll be dwarfed by the cost of the threads or the inefficiencies caused by the threads.
For example, spawning 10 million threads to double 10 million u8s will probably not be worth it. Vectorizing it would probably yield better performance.
That said, you still might be able to get some improvement through parallelizing cheap tasks. But you want to use fewer threads through a thread pool w/ a small number of threads (so you have a (small) number of threads created at any given point, less CPU contention) or something more sophisticated (under the hood, the api is quite simple) like Rayon.
// Notice `.par_iter()` turns it into a `parallel iterator`
let stuff: Vec<u8> = items.par_iter().map(|item| {
some_item_worker(&item)
}).collect();

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string