Run function over range in multiple threads - multithreading

I have a function,
fn calculate(x: i64) -> i64 {
// do stuff
x
}
which I want to apply to a range
for i in 0..100 {
calculate(i);
}
I want to multithread this though. I've tried different things: having an atomic i would be a good idea, but then I'd have to go into the details of shared ownership using libraries etc... is there a simple way of doing this?

If you just want to run stuff on multiple threads and don't really care about the specifics, rayon might be helpful:
use rayon::prelude::*;
fn calculate(x: i64) -> i64 {
x
}
fn main() {
let results = (0..100i64)
.into_par_iter()
.map(calculate)
.collect::<Vec<i64>>();
println!("Results: {:?}", results);
}
This will automatically spin up threads based on how many cores you have and distribute work between them.

Not really certain about what you want to achieve precisely, but here is a trivial example.
let th: Vec<_> = (0..4) // four threads will be launched
.map(|t| {
std::thread::spawn(move || {
let begin = t * 25;
let end = (t + 1) * 25;
for i in begin..end { // each one handles a quarter of the overall computation
let r = calculate(i);
println!("t={} i={} r={}", t, i, r);
}
})
})
.collect();
for t in th { // wait for the four threads to terminate
let _ = t.join();
}

Related

Rust threadpool with init code in each thread?

Following code is working, it can be tested in Playground
use std::{thread, time::Duration};
use rand::Rng;
fn main() {
let mut hiv = Vec::new();
let (sender, receiver) = crossbeam_channel::unbounded();
// make workers
for t in 0..5 {
println!("Make worker {}", t);
let receiver = receiver.clone(); // clone for this thread
let handler = thread::spawn(move || {
let mut rng = rand::thread_rng(); // each thread have one
loop {
let r = receiver.recv();
match r {
Ok(x) => {
let s = rng.gen_range(100..1000);
thread::sleep(Duration::from_millis(s));
println!("w={} r={} working={}", t, x, s);
},
_ => { println!("No more work for {} --- {:?}.", t, r); break},
}
}
});
hiv.push(handler);
}
// Generate jobs
for x in 0..10 {
sender.send(x).expect("all threads hung up :(");
}
drop(sender);
// wait for jobs to finish.
println!("Wait for all threads to finish.\n");
for h in hiv {
h.join().unwrap();
}
println!("join() done. Work Finish.");
}
My question is following :
Can I remove boilerplate code by using threadpool, rayon or some other Rust crate ?
I know that I could do my own implementation, but would like to know is there some crate with same functionality ?
From my research threadpool/rayon are useful when you "send" code and it is executed, but I have not found way to make N threads that will have some code/logic that they need to remember ?
Basic idea is in let mut rng = rand::thread_rng(); this is instance that each thread need to have on it own.
Also is there are some other problems with code, please point it out.
Yes, you can use Rayon to eliminate a lot of that code and make the remaining code much more readable, as illustrated in this gist:
https://gist.github.com/BillBarnhill/db07af903cb3c3edb6e715d9cedae028
The worker pool model is not great in Rust, due to the ownership rules. As a result parallel iterators are often a better choice.
I forgot to address your main concern, per thread context, originally. You can see how to store per thread context using a ThreadLocal! in this answer:
https://stackoverflow.com/a/42656422/204343
I will try to come back and edit the code to reflect ThreadLocal! use as soon as I have more time.
The gist requires nightly because of thread_id_value, but that is all but stable and can be removed if needed.
The real catch is that the gist has timing, and compares main_new with main_original, with surprising results. Perhaps not so surprising, Rayon has good debug support.
On Debug build the timing output is:
main_new duration: 1.525667954s
main_original duration: 1.031234059s
You can see main_new takes almost 50% longer to run.
On release however main_new is a little faster:
main_new duration: 1.584190936s
main_original duration: 1.5851124s
A slimmed version of the gist is below, with only the new code.
#![feature(thread_id_value)]
use std::{thread, time::Duration, time::Instant};
use rand::Rng;
#[allow(unused_imports)]
use rayon::prelude::*;
fn do_work(x : u32) -> String {
let mut rng = rand::thread_rng(); // each thread have one
let s = rng.gen_range(100..1000);
let thread_id = thread::current().id();
let t = thread_id.as_u64();
thread::sleep(Duration::from_millis(s));
format!("w={} r={} working={}", t, x, s)
}
fn process_work_product(output : String) {
println!("{}", output);
}
fn main() {
// bit hacky, but lets set number of threads to 5
rayon::ThreadPoolBuilder::new()
.num_threads(4)
.build_global()
.unwrap();
let x = 0..10;
x.into_par_iter()
.map(do_work)
.for_each(process_work_product);
}

Conway's game of life becomes slower after using multi-threads

I am a complete beginner in Rust and I am currently writing this parallel Conway's game of life. The code itself works fine but the problem is that when using multiple threads the program becomes slower (I measure the speed of the program by counting the time the glider moves from the top-left corner to the bottom-right corner). I did some experiments and it became slower and slower as the number of threads increases. I also have a Java version using almost the same algorithm; it works just fine. All I expect is that the Rust version can become at least slightly faster with threads more than one. Can anyone please point out where I did wrong? I am sorry if the code seems unreasonable, as I said I am a complete beginner :-).
main.rs reads the command line arguments and does the board update.
extern crate clap;
extern crate termion;
extern crate chrono;
use std::thread;
use std::sync::Arc;
use chrono::prelude::*;
mod board;
mod config;
use board::Board;
use config::Config;
fn main() {
let dt1 = Local::now();
let matches = clap::App::new("conway")
.arg(clap::Arg::with_name("length")
.long("length")
.value_name("LENGTH")
.help("Set length of the board")
.takes_value(true))
.arg(clap::Arg::with_name("threads")
.long("threads")
.value_name("THREADS")
.help("How many threads to update the board")
.takes_value(true))
.arg(clap::Arg::with_name("display")
.long("display")
.value_name("DISPLAY")
.help("Display the board or not")
.takes_value(true))
.arg(clap::Arg::with_name("delay")
.long("delay")
.value_name("MILLISECONDS")
.help("Delay between the frames in milliseconds")
.takes_value(true))
.get_matches();
let config = Config::from_matches(matches);
let mut board = Board::new(config.length);
let mut start: bool = false;
let mut end: bool = false;
let mut start_time: DateTime<Local> = Local::now();
let mut end_time: DateTime<Local>;
board.initialize_glider();
loop {
if config.display == 1 {
print!("{}{}", termion::clear::All, termion::cursor::Goto(3, 3));
board_render(&board);
}
if board.board[0][1] == 1 && !start {
start_time = Local::now();
start = true;
}
if board.board[config.length - 1][config.length - 1] == 1 && !end {
end_time = Local::now();
println!("{}", end_time - start_time);
end = true;
}
board = board::Board::update(Arc::new(board), config.threads);
thread::sleep(config.delay);
}
}
fn board_render(board: &Board) {
let mut output = String::with_capacity(board.n * (board.n + 1));
for i in 0..board.n {
for j in 0..board.n {
let ch;
if board.board[i][j] == 0 {
ch = '░';
} else {
ch = '█';
}
output.push(ch);
}
output.push_str("\n ");
}
print!("{}", output);
}
board.rs is where the algorithm for updating the board with multiple threads exists
use std::sync::{Mutex, Arc};
use std::thread;
pub struct Board {
pub n: usize,
pub board: Vec<Vec<i32>>,
}
impl Board {
pub fn new(n: usize) -> Board {
let board = vec![vec![0; n]; n];
Board {
n,
board,
}
}
pub fn update(Board: Arc<Self>, t_num: usize) -> Board {
let new_board = Arc::new(Mutex::new(Board::new(Board.n)));
let mut workers = Vec::with_capacity(t_num);
let block_size = Board.n / t_num;
let mut start = 0;
for t in 0..t_num {
let old_board = Board.clone();
let new_board = Arc::clone(&new_board);
let mut end = start + block_size;
if t == t_num - 1 { end = old_board.n; }
let worker = thread::spawn(move || {
let mut board = new_board.lock().unwrap();
for i in start..end {
for j in 0..old_board.n {
let im = (i + old_board.n - 1) % old_board.n;
let ip = (i + 1) % old_board.n;
let jm = (j + old_board.n - 1) % old_board.n;
let jp = (j + 1) % old_board.n;
let sum = old_board.board[im][jm] + old_board.board[im][j]
+ old_board.board[im][jp] + old_board.board[i][jm] + old_board.board[i][jp]
+ old_board.board[ip][jm] + old_board.board[ip][j] + old_board.board[ip][jp];
if sum == 2 {
board.board[i][j] = old_board.board[i][j];
} else if sum == 3 {
board.board[i][j] = 1;
} else {
board.board[i][j] = 0;
}
}
}
});
workers.push(worker);
start = start + block_size;
}
for worker in workers {
worker.join().unwrap();
}
let result = new_board.lock().unwrap();
let mut board = Board::new(Board.n);
board.board = result.board.to_vec();
board
}
pub fn initialize_glider(&mut self) -> &mut Board {
self.board[0][1] = 1;
self.board[1][2] = 1;
self.board[2][0] = 1;
self.board[2][1] = 1;
self.board[2][2] = 1;
self
}
}
Each worker thread tries to lock the mutex immediately upon starting, and never releases the lock until it's done. Since only one thread can hold the mutex at a time, only one thread can do work at a time.
Here are two ways you might solve this problem:
Don't lock the mutex until you really, really need to. Create a scratch area inside the worker thread that represents the block you are updating. Fill the scratch area first. Then lock the mutex, copy the contents of the scratch area into the new_board, and return.
Using this method, most of the work can be done concurrently, but if all your workers finish at roughly the same time they will still have to take turns putting it all in new_board.
Don't use a lock at all: change the type of self.board to Vec<Vec<AtomicI32>> (std::sync::atomic::AtomicI32) and atomically update the board without having to acquire a lock.
This method may or may not slow down the process of updating, possibly depending on what memory orderings you use¹, but it eliminates contention for the lock.
Free-range advice
Don't call a variable Board. Convention, which the compiler alerts you of, is to give variables snake case names, but beyond that it is confusing because you also have a type named Board. I suggest actually just calling it self, which also lets you call update with method syntax.
Don't put the whole board in an Arc so you can pass it to update, and then make a new board which has to be put in a new Arc the next iteration. Either make update return an Arc itself, or have it take self and do all the Arc-wrangling inside it.
Better still, don't use Arc at all. Use a crate that provides scoped threads to pass your data to the worker threads by reference.
Allocator performance will generally be better with a few large allocations than with many small ones. Change the type of Board.board to Vec<i32> and use arithmetic to calculate the indexes (for instance, point i, j is at index j*n + i).
It's also better not to create and throw away allocations if you don't need to. Typical advice for cellular automata is to create two buffers that contain board states: the current state and the next state. When you're done creating the next state, just swap the buffers so the current state becomes the next state and vice versa.
i32 wastes space; you could use i8 or an enum, or possibly bool.
¹ I would suggest SeqCst unless you really know what you're doing. I suspect Relaxed is probably sufficient, but I don't really know what I'm doing.

How to let struct hold a thread and destroy thread as soon as it go out of scope

struct ThreadHolder{
state: ???
thread: ???
}
impl ThreadHolder {
fn launch(&mut self) {
self.thread = ???
// in thread change self.state
}
}
#[test]
fn test() {
let mut th = ThreadHolder{...};
th.launch();
// thread will be destroy as soon as th go out of scope
}
I think there is something to deal with lifetime, but I don't know how to write it.
What you want is so simple that you don't even need it to be mutable in any way, and then it becomes trivial to share it across threads, unless you want to reset it. You said you need to leave a thread, for one reason or another, therefore I'll assume that you don't care about this.
You instead can poll it every tick (most games run in ticks so I don't think there will be any issue implementing that).
I will provide example that uses sleep, so it's not most accurate thing, it is painfully obvious on the last subsecond duration, but I am not trying to do your work for you anyway, there's enough resources on internet that can help you deal with it.
Here it goes:
use std::{
sync::Arc,
thread::{self, Result},
time::{Duration, Instant},
};
struct Timer {
end: Instant,
}
impl Timer {
fn new(duration: Duration) -> Self {
// this code is valid for now, but might break in the future
// future so distant, that you really don't need to care unless
// you let your players draw for eternity
let end = Instant::now().checked_add(duration).unwrap();
Timer { end }
}
fn left(&self) -> Duration {
self.end.saturating_duration_since(Instant::now())
}
// more usable than above with fractional value being accounted for
fn secs_left(&self) -> u64 {
let span = self.left();
span.as_secs() + if span.subsec_millis() > 0 { 1 } else { 0 }
}
}
fn main() -> Result<()> {
let timer = Timer::new(Duration::from_secs(10));
let timer_main = Arc::new(timer);
let timer = timer_main.clone();
let t = thread::spawn(move || loop {
let seconds_left = timer.secs_left();
println!("[Worker] Seconds left: {}", seconds_left);
if seconds_left == 0 {
break;
}
thread::sleep(Duration::from_secs(1));
});
loop {
let seconds_left = timer_main.secs_left();
println!("[Main] Seconds left: {}", seconds_left);
if seconds_left == 5 {
println!("[Main] 5 seconds left, waiting for worker thread to finish work.");
break;
}
thread::sleep(Duration::from_secs(1));
}
t.join()?;
println!("[Main] worker thread finished work, shutting down!");
Ok(())
}
By the way, this kind of implementation wouldn't be any different in any other language, so please don't blame Rust for it. It's not the easiest language, but it provides more than enough tools to build anything you want from scratch as long as you put effort into it.
Goodluck :)
I think I got it work
use std::sync::{Arc, Mutex};
use std::thread::{sleep, spawn, JoinHandle};
use std::time::Duration;
struct Timer {
pub(crate) time: Arc<Mutex<u32>>,
jh_ticker: Option<JoinHandle<()>>,
}
impl Timer {
fn new<T>(i: T, duration: Duration) -> Self
where
T: Iterator<Item = u32> + Send + 'static,
{
let time = Arc::new(Mutex::new(0));
let arc_time = time.clone();
let jh_ticker = Some(spawn(move || {
for item in i {
let mut mg = arc_time.lock().unwrap();
*mg = item;
drop(mg); // needed, otherwise this thread will always hold lock
sleep(duration);
}
}));
Timer { time, jh_ticker }
}
}
impl Drop for Timer {
fn drop(&mut self) {
self.jh_ticker.take().unwrap().join();
}
}
#[test]
fn test_timer() {
let t = Timer::new(0..=10, Duration::from_secs(1));
let a = t.time.clone();
for _ in 0..100 {
let b = *a.lock().unwrap();
println!("{}", b);
sleep(Duration::from_millis(100));
}
}

Is there an API to race N threads (or N closures on N threads) to completion?

Given several threads that complete with an Output value, how do I get the first Output that's produced? Ideally while still being able to get the remaining Outputs later in the order they're produced, and bearing in mind that some threads may or may not terminate.
Example:
struct Output(i32);
fn main() {
let mut spawned_threads = Vec::new();
for i in 0..10 {
let join_handle: ::std::thread::JoinHandle<Output> = ::std::thread::spawn(move || {
// pretend to do some work that takes some amount of time
::std::thread::sleep(::std::time::Duration::from_millis(
(1000 - (100 * i)) as u64,
));
Output(i) // then pretend to return the `Output` of that work
});
spawned_threads.push(join_handle);
}
// I can do this to wait for each thread to finish and collect all `Output`s
let outputs_in_order_of_thread_spawning = spawned_threads
.into_iter()
.map(::std::thread::JoinHandle::join)
.collect::<Vec<::std::thread::Result<Output>>>();
// but how would I get the `Output`s in order of completed threads?
}
I could solve the problem myself using a shared queue/channels/similar, but are there built-in APIs or existing libraries which could solve this use case for me more elegantly?
I'm looking for an API like:
fn race_threads<A: Send>(
threads: Vec<::std::thread::JoinHandle<A>>
) -> (::std::thread::Result<A>, Vec<::std::thread::JoinHandle<A>>) {
unimplemented!("so far this doesn't seem to exist")
}
(Rayon's join is the closest I could find, but a) it only races 2 closures rather than an arbitrary number of closures, and b) the thread pool w/ work stealing approach doesn't make sense for my use case of having some closures that might run forever.)
It is possible to solve this use case using pointers from How to check if a thread has finished in Rust? just like it's possible to solve this use case using an MPSC channel, however here I'm after a clean API to race n threads (or failing that, n closures on n threads).
These problems can be solved by using a condition variable:
use std::sync::{Arc, Condvar, Mutex};
#[derive(Debug)]
struct Output(i32);
enum State {
Starting,
Joinable,
Joined,
}
fn main() {
let pair = Arc::new((Mutex::new(Vec::new()), Condvar::new()));
let mut spawned_threads = Vec::new();
let &(ref lock, ref cvar) = &*pair;
for i in 0..10 {
let my_pair = pair.clone();
let join_handle: ::std::thread::JoinHandle<Output> = ::std::thread::spawn(move || {
// pretend to do some work that takes some amount of time
::std::thread::sleep(::std::time::Duration::from_millis(
(1000 - (100 * i)) as u64,
));
let &(ref lock, ref cvar) = &*my_pair;
let mut joinable = lock.lock().unwrap();
joinable[i] = State::Joinable;
cvar.notify_one();
Output(i as i32) // then pretend to return the `Output` of that work
});
lock.lock().unwrap().push(State::Starting);
spawned_threads.push(Some(join_handle));
}
let mut should_stop = false;
while !should_stop {
let locked = lock.lock().unwrap();
let mut locked = cvar.wait(locked).unwrap();
should_stop = true;
for (i, state) in locked.iter_mut().enumerate() {
match *state {
State::Starting => {
should_stop = false;
}
State::Joinable => {
*state = State::Joined;
println!("{:?}", spawned_threads[i].take().unwrap().join());
}
State::Joined => (),
}
}
}
}
(playground link)
I'm not claiming this is the simplest way to do it. The condition variable will awake the main thread every time a child thread is done. The list can show the state of each thread, if one is (about to) finish, it can be joined.
No, there is no such API.
You've already been presented with multiple options to solve your problem:
Use channels
Use a CondVar
Use futures
Sometimes when programming, you have to go beyond sticking pre-made blocks together. This is supposed to be a fun part of programming. I encourage you to embrace it. Go create your ideal API using the components available and publish it to crates.io.
I really don't see what's so terrible about the channels version:
use std::{sync::mpsc, thread, time::Duration};
#[derive(Debug)]
struct Output(i32);
fn main() {
let (tx, rx) = mpsc::channel();
for i in 0..10 {
let tx = tx.clone();
thread::spawn(move || {
thread::sleep(Duration::from_millis((1000 - (100 * i)) as u64));
tx.send(Output(i)).unwrap();
});
}
// Don't hold on to the sender ourselves
// Otherwise the loop would never terminate
drop(tx);
for r in rx {
println!("{:?}", r);
}
}

Is it normal to experience large overhead using the 1:1 threading that comes in the standard library?

While working through learning Rust, a friend asked me to see what kind of performance I could get out of Rust for generating the first 1 million prime numbers both single-threaded and multi-threaded. After trying several implementations, I'm just stumped. Here is the kind of performance that I'm seeing:
rust_primes --threads 8 --verbose --count 1000000
Options { verbose: true, count: 1000000, threads: 8 }
Non-concurrent using while (15485863): 2.814 seconds.
Concurrent using mutexes (15485863): 876.561 seconds.
Concurrent using channels (15485863): 798.217 seconds.
Without overloading the question with too much code, here are the methods responsible for each of the benchmarks:
fn non_concurrent(options: &Options) {
let mut count = 0;
let mut current = 0;
let ts = Instant::now();
while count < options.count {
if is_prime(current) {
count += 1;
}
current += 1;
}
let d = ts.elapsed();
println!("Non-concurrent using while ({}): {}.{} seconds.", current - 1, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_mutex(options: &Options) {
let count = Arc::new(Mutex::new(0));
let highest = Arc::new(Mutex::new(0));
let mut cc = 0;
let mut current = 0;
let ts = Instant::now();
while cc < options.count {
let mut handles = vec![];
for x in current..(current + options.threads) {
let count = Arc::clone(&count);
let highest = Arc::clone(&highest);
let handle = thread::spawn(move || {
if is_prime(x) {
let mut c = count.lock().unwrap();
let mut h = highest.lock().unwrap();
*c += 1;
if x > *h {
*h = x;
}
}
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
cc = *count.lock().unwrap();
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using mutexes ({}): {}.{} seconds.", *highest.lock().unwrap(), d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_channel(options: &Options) {
let mut count = 0;
let mut current = 0;
let mut highest = 0;
let ts = Instant::now();
while count < options.count {
let (tx, rx) = mpsc::channel();
for x in current..(current + options.threads) {
let txc = mpsc::Sender::clone(&tx);
thread::spawn(move || {
if is_prime(x) {
txc.send(x).unwrap();
}
});
}
drop(tx);
for message in rx {
count += 1;
if message > highest && count <= options.count {
highest = message;
}
}
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using channels ({}): {}.{} seconds.", highest, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
Am I doing something wrong, or is this normal performance with the 1:1 threading that comes in the standard library?
Here is a MCVE that shows the same problem. I didn't limit the number of threads it starts up at once here like I did in the code above. The point is, threading seems to have a very significant overhead unless I'm doing something horribly wrong.
use std::thread;
use std::time::Instant;
use std::sync::{Mutex, Arc};
use std::time::Duration;
fn main() {
let iterations = 100_000;
non_threaded(iterations);
threaded(iterations);
}
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Arc::new(Mutex::new(0));
let mut handles = vec![];
for _ in 0..iterations {
let counter = Arc::clone(&counter);
let handle = thread::spawn(move || {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
fn non_threaded(iterations: u32) {
let tx = Instant::now();
let mut _q = 0;
for x in 0..iterations {
_q = test(x + 1);
}
let d = tx.elapsed();
println!("Non-threaded in {}.", dur_to_string(d));
}
fn dur_to_string(d: Duration) -> String {
let mut s = d.as_secs().to_string();
s.push_str(".");
s.push_str(&(d.subsec_nanos() / 1_000_000).to_string());
s
}
fn test(x: u32) -> u32 {
x
}
Here are the results of this on my machine:
Non-threaded in 0.9.
Threaded in 5.785.
threading seems to have a very significant overhead
It's not the general concept of "threading", it's the concept of creating and destroying lots of threads.
By default in Rust 1.22.1, each spawned thread allocates 2MiB of memory to use as stack space. In the worst case, your MCVE could allocate ~200GiB of RAM. In reality, this is unlikely to happen as some threads will exit, memory will be reused, etc. I only saw it use ~400MiB.
On top of that, there is overhead involved with inter-thread communication (Mutex, channels, Atomic*) compared to intra-thread variables. Some kind of locking needs to be performed to ensure that all threads see the same data. "Embarrassingly parallel" algorithms tend to not have a lot of communication required. There are also different amounts of time required for different communication primitives. Atomic variables tend to be faster than others in many cases, but aren't as widely usable.
Then there's compiler optimizations to account for. Non-threaded code is way easier to optimize compared to threaded code. For example, running your code in release mode shows:
Non-threaded in 0.0.
Threaded in 142.775.
That's right, the non-threaded code took no time. The compiler can see through the code and realizes that nothing actually happens and removes it all. I don't know how you got 5 seconds for the threaded code as opposed to the 2+ minutes I saw.
Switching to a threadpool will reduce a lot of the unneeded creation of threads. We can also use a threadpool that provides scoped threads, which allows us to avoid the Arc as well:
extern crate scoped_threadpool;
use scoped_threadpool::Pool;
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Mutex::new(0);
let mut pool = Pool::new(8);
pool.scoped(|scope| {
for _ in 0..iterations {
scope.execute(|| {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
}
});
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
Non-threaded in 0.0.
Threaded in 0.675.
As with most pieces of programming, it's crucial to understand the tools you have and to use them appropriately.

Resources