Conway's game of life becomes slower after using multi-threads - multithreading

I am a complete beginner in Rust and I am currently writing this parallel Conway's game of life. The code itself works fine but the problem is that when using multiple threads the program becomes slower (I measure the speed of the program by counting the time the glider moves from the top-left corner to the bottom-right corner). I did some experiments and it became slower and slower as the number of threads increases. I also have a Java version using almost the same algorithm; it works just fine. All I expect is that the Rust version can become at least slightly faster with threads more than one. Can anyone please point out where I did wrong? I am sorry if the code seems unreasonable, as I said I am a complete beginner :-).
main.rs reads the command line arguments and does the board update.
extern crate clap;
extern crate termion;
extern crate chrono;
use std::thread;
use std::sync::Arc;
use chrono::prelude::*;
mod board;
mod config;
use board::Board;
use config::Config;
fn main() {
let dt1 = Local::now();
let matches = clap::App::new("conway")
.arg(clap::Arg::with_name("length")
.long("length")
.value_name("LENGTH")
.help("Set length of the board")
.takes_value(true))
.arg(clap::Arg::with_name("threads")
.long("threads")
.value_name("THREADS")
.help("How many threads to update the board")
.takes_value(true))
.arg(clap::Arg::with_name("display")
.long("display")
.value_name("DISPLAY")
.help("Display the board or not")
.takes_value(true))
.arg(clap::Arg::with_name("delay")
.long("delay")
.value_name("MILLISECONDS")
.help("Delay between the frames in milliseconds")
.takes_value(true))
.get_matches();
let config = Config::from_matches(matches);
let mut board = Board::new(config.length);
let mut start: bool = false;
let mut end: bool = false;
let mut start_time: DateTime<Local> = Local::now();
let mut end_time: DateTime<Local>;
board.initialize_glider();
loop {
if config.display == 1 {
print!("{}{}", termion::clear::All, termion::cursor::Goto(3, 3));
board_render(&board);
}
if board.board[0][1] == 1 && !start {
start_time = Local::now();
start = true;
}
if board.board[config.length - 1][config.length - 1] == 1 && !end {
end_time = Local::now();
println!("{}", end_time - start_time);
end = true;
}
board = board::Board::update(Arc::new(board), config.threads);
thread::sleep(config.delay);
}
}
fn board_render(board: &Board) {
let mut output = String::with_capacity(board.n * (board.n + 1));
for i in 0..board.n {
for j in 0..board.n {
let ch;
if board.board[i][j] == 0 {
ch = '░';
} else {
ch = '█';
}
output.push(ch);
}
output.push_str("\n ");
}
print!("{}", output);
}
board.rs is where the algorithm for updating the board with multiple threads exists
use std::sync::{Mutex, Arc};
use std::thread;
pub struct Board {
pub n: usize,
pub board: Vec<Vec<i32>>,
}
impl Board {
pub fn new(n: usize) -> Board {
let board = vec![vec![0; n]; n];
Board {
n,
board,
}
}
pub fn update(Board: Arc<Self>, t_num: usize) -> Board {
let new_board = Arc::new(Mutex::new(Board::new(Board.n)));
let mut workers = Vec::with_capacity(t_num);
let block_size = Board.n / t_num;
let mut start = 0;
for t in 0..t_num {
let old_board = Board.clone();
let new_board = Arc::clone(&new_board);
let mut end = start + block_size;
if t == t_num - 1 { end = old_board.n; }
let worker = thread::spawn(move || {
let mut board = new_board.lock().unwrap();
for i in start..end {
for j in 0..old_board.n {
let im = (i + old_board.n - 1) % old_board.n;
let ip = (i + 1) % old_board.n;
let jm = (j + old_board.n - 1) % old_board.n;
let jp = (j + 1) % old_board.n;
let sum = old_board.board[im][jm] + old_board.board[im][j]
+ old_board.board[im][jp] + old_board.board[i][jm] + old_board.board[i][jp]
+ old_board.board[ip][jm] + old_board.board[ip][j] + old_board.board[ip][jp];
if sum == 2 {
board.board[i][j] = old_board.board[i][j];
} else if sum == 3 {
board.board[i][j] = 1;
} else {
board.board[i][j] = 0;
}
}
}
});
workers.push(worker);
start = start + block_size;
}
for worker in workers {
worker.join().unwrap();
}
let result = new_board.lock().unwrap();
let mut board = Board::new(Board.n);
board.board = result.board.to_vec();
board
}
pub fn initialize_glider(&mut self) -> &mut Board {
self.board[0][1] = 1;
self.board[1][2] = 1;
self.board[2][0] = 1;
self.board[2][1] = 1;
self.board[2][2] = 1;
self
}
}

Each worker thread tries to lock the mutex immediately upon starting, and never releases the lock until it's done. Since only one thread can hold the mutex at a time, only one thread can do work at a time.
Here are two ways you might solve this problem:
Don't lock the mutex until you really, really need to. Create a scratch area inside the worker thread that represents the block you are updating. Fill the scratch area first. Then lock the mutex, copy the contents of the scratch area into the new_board, and return.
Using this method, most of the work can be done concurrently, but if all your workers finish at roughly the same time they will still have to take turns putting it all in new_board.
Don't use a lock at all: change the type of self.board to Vec<Vec<AtomicI32>> (std::sync::atomic::AtomicI32) and atomically update the board without having to acquire a lock.
This method may or may not slow down the process of updating, possibly depending on what memory orderings you use¹, but it eliminates contention for the lock.
Free-range advice
Don't call a variable Board. Convention, which the compiler alerts you of, is to give variables snake case names, but beyond that it is confusing because you also have a type named Board. I suggest actually just calling it self, which also lets you call update with method syntax.
Don't put the whole board in an Arc so you can pass it to update, and then make a new board which has to be put in a new Arc the next iteration. Either make update return an Arc itself, or have it take self and do all the Arc-wrangling inside it.
Better still, don't use Arc at all. Use a crate that provides scoped threads to pass your data to the worker threads by reference.
Allocator performance will generally be better with a few large allocations than with many small ones. Change the type of Board.board to Vec<i32> and use arithmetic to calculate the indexes (for instance, point i, j is at index j*n + i).
It's also better not to create and throw away allocations if you don't need to. Typical advice for cellular automata is to create two buffers that contain board states: the current state and the next state. When you're done creating the next state, just swap the buffers so the current state becomes the next state and vice versa.
i32 wastes space; you could use i8 or an enum, or possibly bool.
¹ I would suggest SeqCst unless you really know what you're doing. I suspect Relaxed is probably sufficient, but I don't really know what I'm doing.

Related

Why Sequential is faster than parallel?

The code is to count the frequency of each word in an article. In the code, I implemented sequential, muti-thread, and muti-thread with a thread pool.
I test the running time of three methods, however, I found that the sequential method is the fastest one. I use the article (data) at 37423.txt, the code is at play.rust-lang.org.
Below is just the single- and multi version (without the threadpool version):
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
use std::thread;
use std::time::SystemTime;
pub fn word_count(article: &str) -> HashMap<String, i64> {
let now1 = SystemTime::now();
let mut map = HashMap::new();
for word in article.split_whitespace() {
let count = map.entry(word.to_string()).or_insert(0);
*count += 1;
}
let after1 = SystemTime::now();
let d1 = after1.duration_since(now1);
println!("single: {:?}", d1.as_ref().unwrap());
map
}
fn word_count_thread(word_vec: Vec<String>, counts: &Arc<Mutex<HashMap<String, i64>>>) {
let mut p_count = HashMap::new();
for word in word_vec {
*p_count.entry(word).or_insert(0) += 1;
}
let mut counts = counts.lock().unwrap();
for (word, count) in p_count {
*counts.entry(word.to_string()).or_insert(0) += count;
}
}
pub fn mt_word_count(article: &str) -> HashMap<String, i64> {
let word_vec = article
.split_whitespace()
.map(|x| x.to_owned())
.collect::<Vec<String>>();
let count = Arc::new(Mutex::new(HashMap::new()));
let len = word_vec.len();
let q1 = len / 4;
let q2 = len / 2;
let q3 = q1 * 3;
let part1 = word_vec[..q1].to_vec();
let part2 = word_vec[q1..q2].to_vec();
let part3 = word_vec[q2..q3].to_vec();
let part4 = word_vec[q3..].to_vec();
let now2 = SystemTime::now();
let count1 = count.clone();
let count2 = count.clone();
let count3 = count.clone();
let count4 = count.clone();
let handle1 = thread::spawn(move || {
word_count_thread(part1, &count1);
});
let handle2 = thread::spawn(move || {
word_count_thread(part2, &count2);
});
let handle3 = thread::spawn(move || {
word_count_thread(part3, &count3);
});
let handle4 = thread::spawn(move || {
word_count_thread(part4, &count4);
});
handle1.join().unwrap();
handle2.join().unwrap();
handle3.join().unwrap();
handle4.join().unwrap();
let x = count.lock().unwrap().clone();
let after2 = SystemTime::now();
let d2 = after2.duration_since(now2);
println!("muti: {:?}", d2.as_ref().unwrap());
x
}
The result of mine is: single:7.93ms, muti: 15.78ms, threadpool: 25.33ms
I did the separation of the article before calculating time.
I want to know if the code has some problem.
First you may want to know the single-threaded version is mostly parsing whitespace (and I/O, but the file is small so it will be in the OS cache on the second run). At most ~20% of the runtime is the counting that you parallelized. Here is the cargo flamegraph of the single-threaded code:
In the multi-threaded version, it's a mess of thread creation, copying and hashmap overhead. To make sure it's not a "too little data" problem, I've used used 100x your input txt file and I'm measuring a 2x slowdown over the single-threaded version. According to the time command, it uses 2x CPU-time compared to wall-clock, so it seems to do some parallel work. The profile looks like this: (clickable svg version)
I'm not sure what to make of it exactly, but it's clear that memory management overhead has increased a lot. You seem to be copying strings for each hashmap, while an ideal wordcount would probably do zero string copying while counting.
More generally I think it's a bad idea to join the results in the threads - the way you do it (as opposed to a map-reduce pattern) the thread needs a global lock, so you could just pass the results back to the main thread instead for combining. I'm not sure if this is the main problem here, though.
Optimization
To avoid string copying, use HashMap<&str, i64> instead of HashMap<String, i64>. This requires some changes (lifetime annotations and thread::scope()). It makes mt_word_count() about 6x faster compared to the old version.
With a large input I'm measuring now a 4x speedup compared to word_count(). (Which is the best you can hope for with four threads.) The multi-threaded version is now also faster overall, but only by ~20% or so, for the reasons explained above. (Note that the single-threaded baseline has also profited the same &str optimization. Also, many things could still be improved/optimized, but I'll stop here.)
fn word_count_thread<'t>(word_vec: Vec<&'t str>, counts: &Arc<Mutex<HashMap<&'t str, i64>>>) {
let mut p_count = HashMap::new();
for word in word_vec {
*p_count.entry(word).or_insert(0) += 1;
}
let mut counts = counts.lock().unwrap();
for (word, count) in p_count {
*counts.entry(word).or_insert(0) += count;
}
}
pub fn mt_word_count<'t>(article: &'t str) -> HashMap<&'t str, i64> {
let word_vec = article.split_whitespace().collect::<Vec<&str>>();
// (skipping 16 unmodified lines)
let x = thread::scope(|scope| {
let handle1 = scope.spawn(move || {
word_count_thread(part1, &count1);
});
let handle2 = scope.spawn(move || {
word_count_thread(part2, &count2);
});
let handle3 = scope.spawn(move || {
word_count_thread(part3, &count3);
});
let handle4 = scope.spawn(move || {
word_count_thread(part4, &count4);
});
handle1.join().unwrap();
handle2.join().unwrap();
handle3.join().unwrap();
handle4.join().unwrap();
count.lock().unwrap().clone()
});
let after2 = SystemTime::now();
let d2 = after2.duration_since(now2);
println!("muti: {:?}", d2.as_ref().unwrap());
x
}

Run function over range in multiple threads

I have a function,
fn calculate(x: i64) -> i64 {
// do stuff
x
}
which I want to apply to a range
for i in 0..100 {
calculate(i);
}
I want to multithread this though. I've tried different things: having an atomic i would be a good idea, but then I'd have to go into the details of shared ownership using libraries etc... is there a simple way of doing this?
If you just want to run stuff on multiple threads and don't really care about the specifics, rayon might be helpful:
use rayon::prelude::*;
fn calculate(x: i64) -> i64 {
x
}
fn main() {
let results = (0..100i64)
.into_par_iter()
.map(calculate)
.collect::<Vec<i64>>();
println!("Results: {:?}", results);
}
This will automatically spin up threads based on how many cores you have and distribute work between them.
Not really certain about what you want to achieve precisely, but here is a trivial example.
let th: Vec<_> = (0..4) // four threads will be launched
.map(|t| {
std::thread::spawn(move || {
let begin = t * 25;
let end = (t + 1) * 25;
for i in begin..end { // each one handles a quarter of the overall computation
let r = calculate(i);
println!("t={} i={} r={}", t, i, r);
}
})
})
.collect();
for t in th { // wait for the four threads to terminate
let _ = t.join();
}

Why do Rust mutexes not seem to give the lock to the thread that wanted to lock it last?

I wanted to write a program that spawns two threads that lock a Mutex, increase it, print something, and then unlock the Mutex so the other thread can do the same. I added some sleep time to make it more consistent, so I thought the output should be something like:
ping pong ping pong …
but the actual output is pretty random. Most of the time, it is just
ping ping ping … pong
But there's no consistency at all; sometimes there is a “pong” in the middle too.
I was of the belief that mutexes had some kind of way to determine who wanted to lock it last but it doesn’t look like that’s the case.
How does the locking actually work?
How can I get the desired output?
use std::sync::{Arc, Mutex};
use std::{thread, time};
fn main() {
let data1 = Arc::new(Mutex::new(1));
let data2 = data1.clone();
let ten_millis = time::Duration::from_millis(10);
let a = thread::spawn(move || loop {
let mut data = data1.lock().unwrap();
thread::sleep(ten_millis);
println!("ping ");
*data += 1;
if *data > 10 {
break;
}
});
let b = thread::spawn(move || loop {
let mut data = data2.lock().unwrap();
thread::sleep(ten_millis);
println!("pong ");
*data += 1;
if *data > 10 {
break;
}
});
a.join().unwrap();
b.join().unwrap();
}
Mutex and RwLock both defer to OS-specific primitives and cannot be guaranteed to be fair. On Windows, they are both implemented with SRW locks which are specifically documented as not fair. I didn't do research for other operating systems but you definitely cannot rely on fairness with std::sync::Mutex, especially if you need this code to be portable.
A possible solution in Rust is the Mutex implementation provided by the parking_lot crate, which provides an unlock_fair method, which is implemented with a fair algorithm.
From the parking_lot documentation:
By default, mutexes are unfair and allow the current thread to re-lock the mutex before another has the chance to acquire the lock, even if that thread has been blocked on the mutex for a long time. This is the default because it allows much higher throughput as it avoids forcing a context switch on every mutex unlock. This can result in one thread acquiring a mutex many more times than other threads.
However in some cases it can be beneficial to ensure fairness by forcing the lock to pass on to a waiting thread if there is one. This is done by using this method instead of dropping the MutexGuard normally.
While parking_lot::Mutex doesn't claim to be fair without specifically using the unlock_fair method, I found that your code produced the same number of pings as pongs, by just making that switch (playground), not even using the unlock_fair method.
Usually mutexes are unlocked automatically, when a guard goes out of scope. To make it unlock fairly, you need to insert this method call before the guard is dropped:
let b = thread::spawn(move || loop {
let mut data = data1.lock();
thread::sleep(ten_millis);
println!("pong ");
*data += 1;
if *data > 10 {
break;
}
MutexGuard::unlock_fair(data);
});
The order of locking the mutex is not guaranteed in any way; it's possible for the first thread to acquire the lock 100% of the time, while the second thread 0%
The threads are scheduled by the OS and the following scenario is quite possible:
the OS gives CPU time to the first thread and it acquires the lock
the OS gives CPU time to the second thread, but the lock is taken, hence it goes to sleep
The fist thread releases the lock, but is still allowed to run by the OS. It goes for another iteration of the loop and re-acquires the lock
The other thread cannot proceed, because the lock is still taken.
If you give the second thread more time to acquire the lock you will see the expected ping-pong pattern, although there is no guarantee (a bad OS may decide to never give CPU time to some of your threads):
use std::sync::{Arc, Mutex};
use std::{thread, time};
fn main() {
let data1 = Arc::new(Mutex::new(1));
let data2 = data1.clone();
let ten_millis = time::Duration::from_millis(10);
let a = thread::spawn(move || loop {
let mut data = data1.lock().unwrap();
*data += 1;
if *data > 10 {
break;
}
drop(data);
thread::sleep(ten_millis);
println!("ping ");
});
let b = thread::spawn(move || loop {
let mut data = data2.lock().unwrap();
*data += 1;
if *data > 10 {
break;
}
drop(data);
thread::sleep(ten_millis);
println!("pong ");
});
a.join().unwrap();
b.join().unwrap();
}
You can verify that by playing with the sleep time. The lower the sleep time, the more irregular the ping-pong alternations will be, and with values as low as 10ms, you may see ping-ping-pong, etc.
Essentially, a solution based on time is bad by design. You can guarantee that "ping" will be followed by "pong" by improving the algorithm. For instance you can print "ping" on odd numbers and "pong" on even numbers:
use std::sync::{Arc, Mutex};
use std::{thread, time};
const MAX_ITER: i32 = 10;
fn main() {
let data1 = Arc::new(Mutex::new(1));
let data2 = data1.clone();
let ten_millis = time::Duration::from_millis(10);
let a = thread::spawn(move || 'outer: loop {
loop {
thread::sleep(ten_millis);
let mut data = data1.lock().unwrap();
if *data > MAX_ITER {
break 'outer;
}
if *data & 1 == 1 {
*data += 1;
println!("ping ");
break;
}
}
});
let b = thread::spawn(move || 'outer: loop {
loop {
thread::sleep(ten_millis);
let mut data = data2.lock().unwrap();
if *data > MAX_ITER {
break 'outer;
}
if *data & 1 == 0 {
*data += 1;
println!("pong ");
break;
}
}
});
a.join().unwrap();
b.join().unwrap();
}
This isn't the best implementation, but I tried to do it with as few modifications as possible to the original code.
You may also consider an implementation with a Condvar, a better solution, in my opinion, as it avoids the busy waiting on the mutex (ps: also removed the code duplication):
use std::sync::{Arc, Mutex, Condvar};
use std::thread;
const MAX_ITER: i32 = 10;
fn main() {
let cv1 = Arc::new((Condvar::new(), Mutex::new(1)));
let cv2 = cv1.clone();
let a = thread::spawn(ping_pong_task("ping", cv1, |x| x & 1 == 1));
let b = thread::spawn(ping_pong_task("pong", cv2, |x| x & 1 == 0));
a.join().unwrap();
b.join().unwrap();
}
fn ping_pong_task<S: Into<String>>(
msg: S,
cv: Arc<(Condvar, Mutex<i32>)>,
check: impl Fn(i32) -> bool) -> impl Fn()
{
let message = msg.into();
move || {
let (condvar, mutex) = &*cv;
let mut value = mutex.lock().unwrap();
loop {
if check(*value) {
println!("{} ", message);
*value += 1;
condvar.notify_all();
}
if *value > MAX_ITER {
break;
}
value = condvar.wait(value).unwrap();
}
}
}
I was of the belief that mutexes had some kind of way to determine who wanted to lock it last but it doesn’t look like that’s the case.
Nope. The job of a mutex is just to make the code run as fast as possible. Alternation gives the worst performance because you're constantly blowing out the CPU caches. You are asking for the worst possible implementation of a mutex.
How does the locking actually work?
The scheduler tries to get as much work done as possible. It's your job to write code that only does the work you really want to get done.
How can I get the desired output?
Don't use two threads if you just want to do one thing then something else then the first thing again. Use threads when you don't care about the order in which work is done and just want to get as much work done as possible.

Is there an API to race N threads (or N closures on N threads) to completion?

Given several threads that complete with an Output value, how do I get the first Output that's produced? Ideally while still being able to get the remaining Outputs later in the order they're produced, and bearing in mind that some threads may or may not terminate.
Example:
struct Output(i32);
fn main() {
let mut spawned_threads = Vec::new();
for i in 0..10 {
let join_handle: ::std::thread::JoinHandle<Output> = ::std::thread::spawn(move || {
// pretend to do some work that takes some amount of time
::std::thread::sleep(::std::time::Duration::from_millis(
(1000 - (100 * i)) as u64,
));
Output(i) // then pretend to return the `Output` of that work
});
spawned_threads.push(join_handle);
}
// I can do this to wait for each thread to finish and collect all `Output`s
let outputs_in_order_of_thread_spawning = spawned_threads
.into_iter()
.map(::std::thread::JoinHandle::join)
.collect::<Vec<::std::thread::Result<Output>>>();
// but how would I get the `Output`s in order of completed threads?
}
I could solve the problem myself using a shared queue/channels/similar, but are there built-in APIs or existing libraries which could solve this use case for me more elegantly?
I'm looking for an API like:
fn race_threads<A: Send>(
threads: Vec<::std::thread::JoinHandle<A>>
) -> (::std::thread::Result<A>, Vec<::std::thread::JoinHandle<A>>) {
unimplemented!("so far this doesn't seem to exist")
}
(Rayon's join is the closest I could find, but a) it only races 2 closures rather than an arbitrary number of closures, and b) the thread pool w/ work stealing approach doesn't make sense for my use case of having some closures that might run forever.)
It is possible to solve this use case using pointers from How to check if a thread has finished in Rust? just like it's possible to solve this use case using an MPSC channel, however here I'm after a clean API to race n threads (or failing that, n closures on n threads).
These problems can be solved by using a condition variable:
use std::sync::{Arc, Condvar, Mutex};
#[derive(Debug)]
struct Output(i32);
enum State {
Starting,
Joinable,
Joined,
}
fn main() {
let pair = Arc::new((Mutex::new(Vec::new()), Condvar::new()));
let mut spawned_threads = Vec::new();
let &(ref lock, ref cvar) = &*pair;
for i in 0..10 {
let my_pair = pair.clone();
let join_handle: ::std::thread::JoinHandle<Output> = ::std::thread::spawn(move || {
// pretend to do some work that takes some amount of time
::std::thread::sleep(::std::time::Duration::from_millis(
(1000 - (100 * i)) as u64,
));
let &(ref lock, ref cvar) = &*my_pair;
let mut joinable = lock.lock().unwrap();
joinable[i] = State::Joinable;
cvar.notify_one();
Output(i as i32) // then pretend to return the `Output` of that work
});
lock.lock().unwrap().push(State::Starting);
spawned_threads.push(Some(join_handle));
}
let mut should_stop = false;
while !should_stop {
let locked = lock.lock().unwrap();
let mut locked = cvar.wait(locked).unwrap();
should_stop = true;
for (i, state) in locked.iter_mut().enumerate() {
match *state {
State::Starting => {
should_stop = false;
}
State::Joinable => {
*state = State::Joined;
println!("{:?}", spawned_threads[i].take().unwrap().join());
}
State::Joined => (),
}
}
}
}
(playground link)
I'm not claiming this is the simplest way to do it. The condition variable will awake the main thread every time a child thread is done. The list can show the state of each thread, if one is (about to) finish, it can be joined.
No, there is no such API.
You've already been presented with multiple options to solve your problem:
Use channels
Use a CondVar
Use futures
Sometimes when programming, you have to go beyond sticking pre-made blocks together. This is supposed to be a fun part of programming. I encourage you to embrace it. Go create your ideal API using the components available and publish it to crates.io.
I really don't see what's so terrible about the channels version:
use std::{sync::mpsc, thread, time::Duration};
#[derive(Debug)]
struct Output(i32);
fn main() {
let (tx, rx) = mpsc::channel();
for i in 0..10 {
let tx = tx.clone();
thread::spawn(move || {
thread::sleep(Duration::from_millis((1000 - (100 * i)) as u64));
tx.send(Output(i)).unwrap();
});
}
// Don't hold on to the sender ourselves
// Otherwise the loop would never terminate
drop(tx);
for r in rx {
println!("{:?}", r);
}
}

Is it normal to experience large overhead using the 1:1 threading that comes in the standard library?

While working through learning Rust, a friend asked me to see what kind of performance I could get out of Rust for generating the first 1 million prime numbers both single-threaded and multi-threaded. After trying several implementations, I'm just stumped. Here is the kind of performance that I'm seeing:
rust_primes --threads 8 --verbose --count 1000000
Options { verbose: true, count: 1000000, threads: 8 }
Non-concurrent using while (15485863): 2.814 seconds.
Concurrent using mutexes (15485863): 876.561 seconds.
Concurrent using channels (15485863): 798.217 seconds.
Without overloading the question with too much code, here are the methods responsible for each of the benchmarks:
fn non_concurrent(options: &Options) {
let mut count = 0;
let mut current = 0;
let ts = Instant::now();
while count < options.count {
if is_prime(current) {
count += 1;
}
current += 1;
}
let d = ts.elapsed();
println!("Non-concurrent using while ({}): {}.{} seconds.", current - 1, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_mutex(options: &Options) {
let count = Arc::new(Mutex::new(0));
let highest = Arc::new(Mutex::new(0));
let mut cc = 0;
let mut current = 0;
let ts = Instant::now();
while cc < options.count {
let mut handles = vec![];
for x in current..(current + options.threads) {
let count = Arc::clone(&count);
let highest = Arc::clone(&highest);
let handle = thread::spawn(move || {
if is_prime(x) {
let mut c = count.lock().unwrap();
let mut h = highest.lock().unwrap();
*c += 1;
if x > *h {
*h = x;
}
}
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
cc = *count.lock().unwrap();
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using mutexes ({}): {}.{} seconds.", *highest.lock().unwrap(), d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_channel(options: &Options) {
let mut count = 0;
let mut current = 0;
let mut highest = 0;
let ts = Instant::now();
while count < options.count {
let (tx, rx) = mpsc::channel();
for x in current..(current + options.threads) {
let txc = mpsc::Sender::clone(&tx);
thread::spawn(move || {
if is_prime(x) {
txc.send(x).unwrap();
}
});
}
drop(tx);
for message in rx {
count += 1;
if message > highest && count <= options.count {
highest = message;
}
}
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using channels ({}): {}.{} seconds.", highest, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
Am I doing something wrong, or is this normal performance with the 1:1 threading that comes in the standard library?
Here is a MCVE that shows the same problem. I didn't limit the number of threads it starts up at once here like I did in the code above. The point is, threading seems to have a very significant overhead unless I'm doing something horribly wrong.
use std::thread;
use std::time::Instant;
use std::sync::{Mutex, Arc};
use std::time::Duration;
fn main() {
let iterations = 100_000;
non_threaded(iterations);
threaded(iterations);
}
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Arc::new(Mutex::new(0));
let mut handles = vec![];
for _ in 0..iterations {
let counter = Arc::clone(&counter);
let handle = thread::spawn(move || {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
fn non_threaded(iterations: u32) {
let tx = Instant::now();
let mut _q = 0;
for x in 0..iterations {
_q = test(x + 1);
}
let d = tx.elapsed();
println!("Non-threaded in {}.", dur_to_string(d));
}
fn dur_to_string(d: Duration) -> String {
let mut s = d.as_secs().to_string();
s.push_str(".");
s.push_str(&(d.subsec_nanos() / 1_000_000).to_string());
s
}
fn test(x: u32) -> u32 {
x
}
Here are the results of this on my machine:
Non-threaded in 0.9.
Threaded in 5.785.
threading seems to have a very significant overhead
It's not the general concept of "threading", it's the concept of creating and destroying lots of threads.
By default in Rust 1.22.1, each spawned thread allocates 2MiB of memory to use as stack space. In the worst case, your MCVE could allocate ~200GiB of RAM. In reality, this is unlikely to happen as some threads will exit, memory will be reused, etc. I only saw it use ~400MiB.
On top of that, there is overhead involved with inter-thread communication (Mutex, channels, Atomic*) compared to intra-thread variables. Some kind of locking needs to be performed to ensure that all threads see the same data. "Embarrassingly parallel" algorithms tend to not have a lot of communication required. There are also different amounts of time required for different communication primitives. Atomic variables tend to be faster than others in many cases, but aren't as widely usable.
Then there's compiler optimizations to account for. Non-threaded code is way easier to optimize compared to threaded code. For example, running your code in release mode shows:
Non-threaded in 0.0.
Threaded in 142.775.
That's right, the non-threaded code took no time. The compiler can see through the code and realizes that nothing actually happens and removes it all. I don't know how you got 5 seconds for the threaded code as opposed to the 2+ minutes I saw.
Switching to a threadpool will reduce a lot of the unneeded creation of threads. We can also use a threadpool that provides scoped threads, which allows us to avoid the Arc as well:
extern crate scoped_threadpool;
use scoped_threadpool::Pool;
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Mutex::new(0);
let mut pool = Pool::new(8);
pool.scoped(|scope| {
for _ in 0..iterations {
scope.execute(|| {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
}
});
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
Non-threaded in 0.0.
Threaded in 0.675.
As with most pieces of programming, it's crucial to understand the tools you have and to use them appropriately.

Resources