I would like to perform the following processing in multi-threads using tokio or async-std. I have read tutorials but I haven't seen any mention of parallelizing a for loop. In my program, all threads refer to the same array but will access different locations:
let input_array: Array2<f32>;
let output_array: Array2<f32>;
for i in 0..roop_num {
let res = do_some_func(&input_array, idx);
output_array.slice_mut(s![idx, ...]) .assign(&res);
}
I would like to change the for loop to use parallel processing.
Tokio or async-std deal with concurrency, not parallelism. If you need the data parallelism then rayon is a better choice. If you are using an Iterator, then .chunks() method is good. For more imperative approach you can use .par_chunks_mut().
Related
I was thinking about using Rayon's parallel iterator feature, but I'm concerned about performance for iterating over small collections.
Parallelism overhead sometimes can cause a slowdown on small collections. Iterating over 2 elements is slower if I do the necessary preparations for multi-threading than if I used a single-threaded version. If I have 40 million elements, parallelism will give me a linear performance improvement.
I read about ParallelIterator::weight (0.6.0), but I don't understand if I should optimize such corner cases for small collections or if Rayon is smart and handles everything under the under the hood.
if collection_is_small() {
// Run single threaded version...
} else {
// Use parallel iterator.
}
The ParallelIterator::weight of the processed element is 1. See relevant documentation for good definition, but processing of a single element is cheap.
Google sent me to an old documentation page. Weight was deprecated and removed since version 0.8.0.
Weight API was deprecated in favor of split length control.
By default Rayon will split at every item, effectively making all computation parallel, this behavior can be configured via
with_min_len.
Sets the minimum length of iterators desired to process in each thread. Rayon will not split any smaller than this length, but of course an iterator could already be smaller to begin with.
Producers like zip and interleave will use greater of the two minimums. Chained iterators and iterators inside flat_map may each use their own minimum length.
extern crate rayon; // 1.0.3
use rayon::prelude::*;
use std::thread;
fn main() {
println!("Main thread: {:?}", thread::current().id());
let ids: Vec<_> = (0..4)
.into_par_iter()
.with_min_len(4)
.map(|_| thread::current().id())
.collect();
println!("Iterations: {:?}", ids);
}
Output:
Main thread: ThreadId(0)
Iterations: [ThreadId(0), ThreadId(0), ThreadId(0), ThreadId(0)]
Playground (thanks to #shepmaster for code)
You can empirically see that such a behavior is not guaranteed:
use rayon::prelude::*; // 1.0.3
use std::thread;
fn main() {
let ids: Vec<_> = (0..2)
.into_par_iter()
.map(|_| thread::current().id())
.collect();
println!("{:?}", ids);
}
Various runs of the program show:
[ThreadId(1), ThreadId(2)]
[ThreadId(1), ThreadId(1)]
[ThreadId(2), ThreadId(1)]
[ThreadId(2), ThreadId(2)]
That being said, you should perform your own benchmarking. By default, Rayon creates a global threadpool and uses work stealing to balance the work between the threads. The threadpool is a one-time setup cost per process and work-stealing helps ensure that work only crosses thread boundaries when needed. This is why there are outputs above where both use the same thread.
I have a MyReader that implements Iterator and produces Buffers where Buffer : Send. MyReader produces a lot of Buffers very quickly, but I have a CPU-intensive job to perform on each Buffer (.map(|buf| ...)) that is my bottleneck, and then gather the results (ordered). I want to parallelize the CPU intense work - hopefully to N threads, that would use work stealing to perform them as fast as the number of cores allows.
Edit: To be more precise. I am working on rdedup. MyStruct is Chunker which reads io::Read (typically stdio), finds parts (chunks) of data and yields them. Then map() is supposed, for each chunk, to calculate sha256 digest of it, compress, encrypt, save and return the digest as the result of map(...). Digest of saved data is used to build index of the data. The order between chunks being processed by map(...) does not matter, but digest returned from each map(...) needs to be collected in the same order that the chunks were found. The actual save to file step is offloaded to yet another thread (writter thread). actual code of PR in question
I hoped I can use rayon for this, but rayon expect an iterator that is already parallizable - eg. a Vec<...> or something like that. I have found no way to get a par_iter from MyReader - my reader is very single-threaded in nature.
There is simple_parallel but documentation says it's not recommended for general use. And I want to make sure everything will just work.
I could just take a spmc queue implementation and a custom thread_pool, but I was hopping for an existing solution that is optimized and tested.
There's also pipeliner but doesn't support ordered map yet.
In general, preserving order is a pretty tough requirement as far as parallelization goes.
You could try to hand-make it with a typical fan-out/fan-in setup:
a single producer which tags inputs with a sequential monotonically increasing ID,
a thread pool which consumes from this producer and then sends the result toward the final consumer,
a consumer who buffers and reorders result so as to treat them in the sequential order.
Or you could raise the level of abstraction.
Of specific interest here: Future.
A Future represents the result of a computation, which may or may not have happened yet. A consumer receiving an ordered list of Future can simply wait on each one, and let buffering occur naturally in the queue.
For bonus points, if you use a fixed size queue, you automatically get back-pressure on the consumer.
And therefore I would recommend building something of CpuPool.
The setup is going to be:
use std::sync::mpsc::{Receiver, Sender};
fn produce(sender: Sender<...>) {
let pool = CpuPool::new_num_cpus();
for chunk in reader {
let future = pool.spawn_fn(|| /* do work */);
sender.send(future);
}
// Dropping the sender signals there's no more work to consumer
}
fn consume(receiver: Receiver<...>) {
while let Ok(future) = receiver.recv() {
let item = future.wait().expect("Computation Error?");
/* do something with item */
}
}
fn main() {
let (sender, receiver) = std::sync::mpsc::channel();
std::thread::spawn(move || consume(receiver));
produce(sender);
}
There is now a dpc-pariter crate. Simply replace iter.map(fn) with iter.parallel_map(fn), which will perform work in parallel while preserving result order. From the docs:
* drop-in replacement for standard iterators(*)
* preserves order
* lazy, somewhat like single-threaded iterators
* panic propagation
* support for iterating over borrowed values using scoped threads
* backpressure
* profiling methods (useful for analyzing pipelined processing bottlenecks)
Also, Rayon has an open issue with a great in-depth discussion of various implementation details and limitations.
I have two Futures, the second of which starts after the first ended. Both write to the same ArrayBuffer instance, but since they are executed serially (not at the same time), I consider them not acting concurrently.
However, I know there is the #volatile annotation for variables shared among two or more threads (#volatile disables caching).
Since after the first thread finishes, inside the ArrayBuffer instance, there might be some caching going on that makes it impossible for the second thread to see the ArrayBuffer's real state: I am not sure whether it is safe to use ArrayBuffer this way.
Is it true that caching might be a problem in my situation, and if this is the case: Is there a recommended way to make ArrayBuffer use #volatile internally?
It should be fine iff (if-and-only-if) you propagate it [the array] through the future:
val futureA = Future {
val buf = ArrayBuffer(…)
update(buf)
buf
}
val futureB = futureA map {
buf => moreUpdates(buf); buf
}
futureB foreach println // print the result of the transformations
This is OK from a memory safety point of view because the completion of futureA happens-before the onComplete (virtually all transformations on Future is implemented on top of onComplete) callback is invoked. In this case map.
The problem is not caching, per se, but the fact that an ArrayBuffer is a composite, with several subfields that have to be updated in concert to assure correct operation. You will need to use thread synchronization tools to ensure this.
class ArrayBufferWrapper[T](ab: ArrayBuffer[T]) {
def add(item: T) = {
this.synchronized {
ab.add(item)
}
}
}
By wrapping the ArrayBuffer, the components are properly realized into the current thread, and you ensure thread-safe add operations.
No, it is not safe.
This is exactly the reason why they invented functional programming. If you are using scala anyway, might as well take advantage of the paradigm it offers.
Avoid using mutable structures, or, at least, in the rare cases when you have to use them, do not let them escape the local scope. Then you won't ever have to deal with problems like this. They just will not exist anymore.
Tell us more about what you are trying to do, and i am sure someone will suggest a design or two, not involving two threads mutating the same structure.
I know multi thread with future a little such as :
for(i <- 1 to 5) yield future {
println(i)
}
but this is all the threads do same work.
So, i want to know how to make two threads which do different work concurrently.
Also, I want to know is there any method to know all the thread is complete?
Please, give me something simple.
First of all, chances are you might be happy with parallel collections, especially if all you need is to crunch some data in parallel using multiple threads:
val lines = Seq("foo", "bar", "baz")
lines.par.map(line => line.length)
While parallel collections suitable for finite datasets, Futures are more oriented towards events-like processing and in fact, future defines task, abstracting away from execution details (one thread, multiple threads, how particular task is pinned to thread) -- all of this is controlled with execution context. What you can do with futures though is to add callback (on success, on failure, on both), compose it with another future or await for result. All this concepts are nicely explained in official doc which is worthwhile reading.
It seems to be a common idiom in Rust to spawn off a thread for blocking IO so you can use non-blocking channels:
use std::sync::mpsc::channel;
use std::thread;
use std::net::TcpListener;
fn main() {
let (accept_tx, accept_rx) = channel();
let listener_thread = thread::spawn(move || {
let listener = TcpListener::bind(":::0").unwrap();
for client in listener.incoming() {
if let Err(_) = accept_tx.send(client.unwrap()) {
break;
}
}
});
}
The problem is, rejoining threads like this depends on the spawned thread "realizing" that the receiving end of the channel has been dropped (i.e., calling send(..) returns Err(_)):
drop(accept_rx);
listener_thread.join(); // blocks until listener thread reaches accept_tx.send(..)
You can make dummy connections for TcpListeners, and shutdown TcpStreams via a clone, but these seem like really hacky ways to clean up such threads, and as it stands, I don't even know of a hack to trigger a thread blocking on a read from stdin to join.
How can I clean up threads like these, or is my architecture just wrong?
One simply cannot safely cancel a thread reliably in Windows or Linux/Unix/POSIX, so it isn't available in the Rust standard library.
Here is an internals discussion about it.
There are a lot of unknowns that come from cancelling threads forcibly. It can get really messy. Beyond that, the combination of threads and blocking I/O will always face this issue: you need every blocking I/O call to have timeouts for it to even have a chance of being interruptible reliably. If one can't write async code, one needs to either use processes (which have a defined boundary and can be ended by the OS forcibly, but obviously come with heavier weight and data sharing challenges) or non-blocking I/O which will land your thread back in an event loop that is interruptible.
mio is available for async code. Tokio is a higher level crate based on mio which makes writing non-blocking async code even more straight forward.