How do I run parallel threads of computation on a partitioned array? - multithreading

I'm trying to distribute an array across threads and have the threads sum up portions of the array in parallel. I want thread 0 to sum elements 0 1 2 and Thread 1 sum elements 3 4 5. Thread 2 to sum 6 and 7. and Thread 3 to sum 8 and 9.
I'm new to Rust but have coded with C/C++/Java before. I've literally thrown everything and the garbage sink at this program and I was hoping I could receive some guidance.
Sorry my code is sloppy but I will clean it up when it is a finished product. Please ignore all poorly named variables/inconsistent spacing/etc.
use std::io;
use std::rand;
use std::sync::mpsc::{Sender, Receiver};
use std::sync::mpsc;
use std::thread::Thread;
static NTHREADS: usize = 4;
static NPROCS: usize = 10;
fn main() {
let mut a = [0; 10]; // a: [i32; 10]
let mut endpoint = a.len() / NTHREADS;
let mut remElements = a.len() % NTHREADS;
for x in 0..a.len() {
let secret_number = (rand::random::<i32>() % 100) + 1;
a[x] = secret_number;
println!("{}", a[x]);
}
let mut b = a;
let mut x = 0;
check_sum(&mut a);
// serial_sum(&mut b);
// Channels have two endpoints: the `Sender<T>` and the `Receiver<T>`,
// where `T` is the type of the message to be transferred
// (type annotation is superfluous)
let (tx, rx): (Sender<i32>, Receiver<i32>) = mpsc::channel();
let mut scale: usize = 0;
for id in 0..NTHREADS {
// The sender endpoint can be copied
let thread_tx = tx.clone();
// Each thread will send its id via the channel
Thread::spawn(move || {
// The thread takes ownership over `thread_tx`
// Each thread queues a message in the channel
let numTougherThreads: usize = NPROCS % NTHREADS;
let numTasksPerThread: usize = NPROCS / NTHREADS;
let mut lsum = 0;
if id < numTougherThreads {
let mut q = numTasksPerThread+1;
lsum = 0;
while q > 0 {
lsum = lsum + a[scale];
scale+=1;
q = q-1;
}
println!("Less than numToughThreads lsum: {}", lsum);
}
if id >= numTougherThreads {
let mut z = numTasksPerThread;
lsum = 0;
while z > 0 {
lsum = lsum + a[scale];
scale +=1;
z = z-1;
}
println!("Greater than numToughthreads lsum: {}", lsum);
}
// Sending is a non-blocking operation, the thread will continue
// immediately after sending its message
println!("thread {} finished", id);
thread_tx.send(lsum).unwrap();
});
}
// Here, all the messages are collected
let mut globalSum = 0;
let mut ids = Vec::with_capacity(NTHREADS);
for _ in 0..NTHREADS {
// The `recv` method picks a message from the channel
// `recv` will block the current thread if there no messages available
ids.push(rx.recv());
}
println!("Global Sum: {}", globalSum);
// Show the order in which the messages were sent
println!("ids: {:?}", ids);
}
fn check_sum (arr: &mut [i32]) {
let mut sum = 0;
let mut i = 0;
let mut size = arr.len();
loop {
sum += arr[i];
i+=1;
if i == size { break; }
}
println!("CheckSum is {}", sum);
}
So far I've gotten it to do this much. Can't figure out why threads 0 and 1 have the same sum as well as 2 and 3 doing the same thing:
-5
-49
-32
99
45
-65
-64
-29
-56
65
CheckSum is -91
Greater than numTough lsum: -54
thread 2 finished
Less than numTough lsum: -86
thread 1 finished
Less than numTough lsum: -86
thread 0 finished
Greater than numTough lsum: -54
thread 3 finished
Global Sum: 0
ids: [Ok(-86), Ok(-86), Ok(-54), Ok(-54)]
I managed to rewrite it to work with even numbers by using the below code.
while q > 0 {
if id*s+scale == a.len() { break; }
lsum = lsum + a[id*s+scale];
scale +=1;
q = q-1;
}
println!("Less than numToughThreads lsum: {}", lsum);
}
if id >= numTougherThreads {
let mut z = numTasksPerThread;
lsum = 0;
let mut scale = 0;
while z > 0 {
if id*numTasksPerThread+scale == a.len() { break; }
lsum = lsum + a[id*numTasksPerThread+scale];
scale = scale + 1;
z = z-1;
}

Welcome to Rust! :)
Yeah at first I didn't realize each thread gets it's own copy of scale
Not only that! It also gets its own copy of a!
What you are trying to do could look like the following code. I guess it's easier for you to see a complete working example since you seem to be a Rust beginner and asked for guidance. I deliberately replaced [i32; 10] with a Vec since a Vec is not implicitly Copyable. It requires an explicit clone(); we cannot copy it by accident. Please note all the larger and smaller differences. The code also got a little more functional (less mut). I commented most of the noteworthy things:
extern crate rand;
use std::sync::Arc;
use std::sync::mpsc;
use std::thread;
const NTHREADS: usize = 4; // I replaced `static` by `const`
// gets used for *all* the summing :)
fn sum<I: Iterator<Item=i32>>(iter: I) -> i32 {
let mut s = 0;
for x in iter {
s += x;
}
s
}
fn main() {
// We don't want to clone the whole vector into every closure.
// So we wrap it in an `Arc`. This allows sharing it.
// I also got rid of `mut` here by moving the computations into
// the initialization.
let a: Arc<Vec<_>> =
Arc::new(
(0..10)
.map(|_| {
(rand::random::<i32>() % 100) + 1
})
.collect()
);
let (tx, rx) = mpsc::channel(); // types will be inferred
{ // local scope, we don't need the following variables outside
let num_tasks_per_thread = a.len() / NTHREADS; // same here
let num_tougher_threads = a.len() % NTHREADS; // same here
let mut offset = 0;
for id in 0..NTHREADS {
let chunksize =
if id < num_tougher_threads {
num_tasks_per_thread + 1
} else {
num_tasks_per_thread
};
let my_a = a.clone(); // refers to the *same* `Vec`
let my_tx = tx.clone();
thread::spawn(move || {
let end = offset + chunksize;
let partial_sum =
sum( (&my_a[offset..end]).iter().cloned() );
my_tx.send(partial_sum).unwrap();
});
offset += chunksize;
}
}
// We can close this Sender
drop(tx);
// Iterator magic! Yay! global_sum does not need to be mutable
let global_sum = sum(rx.iter());
println!("global sum via threads : {}", global_sum);
println!("global sum single-threaded: {}", sum(a.iter().cloned()));
}

Using a crate like crossbeam you can write this code:
use crossbeam; // 0.7.3
use rand::distributions::{Distribution, Uniform}; // 0.7.3
const NTHREADS: usize = 4;
fn random_vec(length: usize) -> Vec<i32> {
let step = Uniform::new_inclusive(1, 100);
let mut rng = rand::thread_rng();
step.sample_iter(&mut rng).take(length).collect()
}
fn main() {
let numbers = random_vec(10);
let num_tasks_per_thread = numbers.len() / NTHREADS;
crossbeam::scope(|scope| {
// The `collect` is important to eagerly start the threads!
let threads: Vec<_> = numbers
.chunks(num_tasks_per_thread)
.map(|chunk| scope.spawn(move |_| chunk.iter().cloned().sum::<i32>()))
.collect();
let thread_sum: i32 = threads.into_iter().map(|t| t.join().unwrap()).sum();
let no_thread_sum: i32 = numbers.iter().cloned().sum();
println!("global sum via threads : {}", thread_sum);
println!("global sum single-threaded: {}", no_thread_sum);
})
.unwrap();
}
Scoped threads allow you to pass in a reference that is guaranteed to outlive the thread. You can then use the return value of the thread directly, skipping channels (which are great, just not needed here!).
I followed How can I generate a random number within a range in Rust? to generate the random numbers. I also changed it to be the range [1,100], as I think that's what you meant. However, your original code is actually [-98,100], which you could also do.
Iterator::sum is used to sum up an iterator of numbers.
I threw in some rough performance numbers of the thread work, ignoring the vector construction, working on 100,000,000 numbers, using Rust 1.34 and compiling in release mode:
| threads | time (ns) | relative time (%) |
|---------+-----------+-------------------|
| 1 | 33824667 | 100.00 |
| 2 | 16246549 | 48.03 |
| 3 | 16709280 | 49.40 |
| 4 | 14263326 | 42.17 |
| 5 | 14977901 | 44.28 |
| 6 | 12974001 | 38.36 |
| 7 | 13321743 | 39.38 |
| 8 | 13370793 | 39.53 |
See also:
How can I pass a reference to a stack variable to a thread?

All your tasks get a copy of the scale variable. Thread 1 and 2 both do the same thing since each has scale with a value of 0 and modifies it in the same manner as the other thread.
The same goes for Thread 3 and 4.
Rust prevents you from breaking thread safety. If scale were shared by the threads, you would have race conditions when accessing the variable.
Please read about closures, they explain the variable copying part, and about threading which explains when and how you can share variables between threads.

Related

Rust - Use threads to iterate over a vector

My program creates a grid of numbers, then based on the sum of the numbers surrounding each on the grid the number will change in a set way. I'm using two vectors currently, filling the first with random numbers, calculating the changes, then putting the new values in the second vector. After the new values go into the second vector I then push them back into the first vector before going through the next loop. The error I get currently is:
error[E0499]: cannot borrow `grid_a` as mutable more than once at a time
--> src\main.rs:40:29
|
38 | thread::scope(|s| {
| - has type `&crossbeam::thread::Scope<'1>`
39 | for j in 0..grid_b.len() {
40 | s.spawn(|_| {
| - ^^^ `grid_a` was mutably borrowed here in the previous iteration of the loop
| _____________________|
| |
41 | | grid_a[j] = grid_b[j];
| | ------ borrows occur due to use of `grid_a` in closure
42 | | });
| |______________________- argument requires that `grid_a` is borrowed for `'1`
My current code is below. I'm way more familiar with C++ and C#, in the process of trying to learn Rust for this assignment. If I remove the thread everything compiles and runs properly. I'm not understanding how to avoid the multiple borrow. Ideally I'd like to use a separate thread::scope on with the for loop above the existing thread::scope as well.
use crossbeam::thread;
use rand::Rng;
use std::sync::{Arc, Mutex};
use std::thread::sleep;
use std::time::Duration;
use std::time::Instant;
static NUMROWS: i32 = 4;
static NUMCOLUMNS: i32 = 7;
static GRIDSIZE: i32 = NUMROWS * NUMCOLUMNS;
static PLUSNC: i32 = NUMCOLUMNS + 1;
static MINUSNC: i32 = NUMCOLUMNS - 1;
static NUMLOOP: i32 = 7;
static HIGH: u32 = 35;
fn main() {
let start = Instant::now();
let length = usize::try_from(GRIDSIZE).unwrap();
let total_checks = Arc::new(Mutex::new(0));
let mut grid_a = Vec::<u32>::with_capacity(length);
let mut grid_b = Vec::<u32>::with_capacity(length);
grid_a = fill_grid();
for h in 1..=NUMLOOP {
println!("-- {} --", h);
print_grid(&grid_a);
if h != NUMLOOP {
for i in 0..grid_a.len() {
let mut total_checks = total_checks.lock().unwrap();
grid_b[i] = checker(&grid_a, i.try_into().unwrap());
*total_checks += 1;
}
grid_a.clear();
thread::scope(|s| {
for j in 0..grid_b.len() {
s.spawn(|_| {
grid_a[j] = grid_b[j];
});
}
})
.unwrap();
grid_b.clear();
}
}
When you access a vector (or any slice) via index you're borrowing the whole vector.
You can use iterators which can give you mutable references to all the items in parallel.
use crossbeam::thread;
static NUMROWS: i32 = 4;
static NUMCOLUMNS: i32 = 7;
static GRIDSIZE: i32 = NUMROWS * NUMCOLUMNS;
static NUMLOOP: i32 = 7;
fn fill_grid() -> Vec<u32> {
(0..GRIDSIZE as u32).into_iter().collect()
}
fn main() {
let length = usize::try_from(GRIDSIZE).unwrap();
// just do this since else we create a vec and throw it away immediately
let mut grid_a = fill_grid();
let mut grid_b = Vec::<u32>::with_capacity(length);
for h in 1..=NUMLOOP {
println!("-- {} --", h);
println!("{grid_a:?}");
if h != NUMLOOP {
// removed a bunch of unrelated code
for i in 0..grid_a.len() {
grid_b.push(grid_a[i]);
}
// since we overwrite grid_a anyways we don't need to clear it.
// it would give us headaches anyways since grid_a[j] on an empty
// Vec panics.
// grid_a.clear();
thread::scope(|s| {
// instead of accessing the element via index we iterate over
// mutable references so we don't have to borrow the whole
// vector inside the thread
for (pa, b) in grid_a.iter_mut().zip(grid_b.iter().copied()) {
s.spawn(move |_| {
*pa = b + 1;
});
}
})
.unwrap();
grid_b.clear();
}
}
} // add missing }

Sharing arrays between threads in Rust

I'm new to Rust and I'm struggling with some ownership semantics.
The goal is to do some nonsense measurements on multiplying 2 f64 arrays and writing the result in a third array.
In the single-threaded version, a single thread takes care of the whole range. In the multi-threaded version, each thread takes care of a segment of the range.
The single-threaded version is easy, but my problem is with the multithreaded version where I'm struggling with the ownership rules.
I was thinking to use raw pointers, to bypass the borrow checker. But I'm still not able to make it pass.
#![feature(box_syntax)]
use std::time::SystemTime;
use rand::Rng;
use std::thread;
fn main() {
let nCells = 1_000_000;
let concurrency = 1;
let mut one = box [0f64; 1_000_000];
let mut two = box [0f64; 1_000_000];
let mut res = box [0f64; 1_000_000];
println!("Creating data");
let mut rng = rand::thread_rng();
for i in 0..nCells {
one[i] = rng.gen::<f64>();
two[i] = rng.gen::<f64>();
res[i] = 0 as f64;
}
println!("Finished creating data");
let rounds = 100000;
let start = SystemTime::now();
let one_raw = Box::into_raw(one);
let two_raw = Box::into_raw(two);
let res_raw = Box::into_raw(res);
let mut handlers = Vec::new();
for _ in 0..rounds {
let sizePerJob = nCells / concurrency;
for j in 0..concurrency {
let from = j * sizePerJob;
let to = (j + 1) * sizePerJob;
handlers.push(thread::spawn(|| {
unsafe {
unsafe {
processData(one_raw, two_raw, res_raw, from, to);
}
}
}));
}
for j in 0..concurrency {
handlers.get_mut(j).unwrap().join();
}
handlers.clear();
}
let durationUs = SystemTime::now().duration_since(start).unwrap().as_micros();
let durationPerRound = durationUs / rounds;
println!("duration per round {} us", durationPerRound);
}
// Make sure we can find the function in the generated Assembly
#[inline(never)]
pub fn processData(one: *const [f64;1000000],
two: *const [f64;1000000],
res: *mut [f64;1000000],
from: usize,
to: usize) {
unsafe {
for i in from..to {
(*res)[i] = (*one)[i] * (*two)[i];
}
}
}
This is the error I'm getting
error[E0277]: `*mut [f64; 1000000]` cannot be shared between threads safely
--> src/main.rs:38:27
|
38 | handlers.push(thread::spawn(|| {
| ^^^^^^^^^^^^^ `*mut [f64; 1000000]` cannot be shared between threads safely
|
= help: the trait `Sync` is not implemented for `*mut [f64; 1000000]`
= note: required because of the requirements on the impl of `Send` for `&*mut [f64; 1000000]`
note: required because it's used within this closure
--> src/main.rs:38:41
|
38 | handlers.push(thread::spawn(|| {
| ^^
note: required by a bound in `spawn`
--> /home/pveentjer/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/std/src/thread/mod.rs:653:8
|
653 | F: Send + 'static,
| ^^^^ required by this bound in `spawn`
[edit] I know that spawning threads is very expensive. I'll convert this to a pool of worker threads that can be recycled once this code is up and running.
You can use chunks_mut or split_at_mut to get non-overlapping slices of one two and res. You can then access different slices from different threads safely. See: documentation for chunks_mut and documentation for split_at_mut
I was able to compile it using scoped threads and chunks_mut. I have removed all the unsafe stuff because there is no need. See the code:
#![feature(box_syntax)]
#![feature(scoped_threads)]
use rand::Rng;
use std::thread;
use std::time::SystemTime;
fn main() {
let nCells = 1_000_000;
let concurrency = 2;
let mut one = box [0f64; 1_000_000];
let mut two = box [0f64; 1_000_000];
let mut res = box [0f64; 1_000_000];
println!("Creating data");
let mut rng = rand::thread_rng();
for i in 0..nCells {
one[i] = rng.gen::<f64>();
two[i] = rng.gen::<f64>();
res[i] = 0 as f64;
}
println!("Finished creating data");
let rounds = 1000;
let start = SystemTime::now();
for _ in 0..rounds {
let size_per_job = nCells / concurrency;
thread::scope(|s| {
for it in one
.chunks_mut(size_per_job)
.zip(two.chunks_mut(size_per_job))
.zip(res.chunks_mut(size_per_job))
{
let ((one, two), res) = it;
s.spawn(|| {
processData(one, two, res);
});
}
});
}
let durationUs = SystemTime::now().duration_since(start).unwrap().as_micros();
let durationPerRound = durationUs / rounds;
println!("duration per round {} us", durationPerRound);
}
// Make sure we can find the function in the generated Assembly
#[inline(never)]
pub fn processData(one: &[f64], two: &[f64], res: &mut [f64]) {
for i in 0..one.len() {
res[i] = one[i] * two[i];
}
}

How to give each CPU core mutable access to a portion of a Vec? [duplicate]

This question already has an answer here:
How do I pass disjoint slices from a vector to different threads?
(1 answer)
Closed 4 years ago.
I've got an embarrassingly parallel bit of graphics rendering code that I would like to run across my CPU cores. I've coded up a test case (the function computed is nonsense) to explore how I might parallelize it. I'd like to code this using std Rust in order to learn about using std::thread. But, I don't understand how to give each thread a portion of the framebuffer. I'll put the full testcase code below, but I'll try to break it down first.
The sequential form is super simple:
let mut buffer0 = vec![vec![0i32; WIDTH]; HEIGHT];
for j in 0..HEIGHT {
for i in 0..WIDTH {
buffer0[j][i] = compute(i as i32,j as i32);
}
}
I thought that it would help to make a buffer that was the same size, but re-arranged to be 3D & indexed by core first. This is the same computation, just a reordering of the data to show the workings.
let mut buffer1 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
for c in 0..num_logical_cores {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
buffer1[c][y][i] = compute(i as i32,j as i32)
}
}
}
But, when I try to put the inner part of the code in a closure & create a thread, I get errors about the buffer & lifetimes. I basically don't understand what to do & could use some guidance. I want per_core_buffer to just temporarily refer to the data in buffer2 that belongs to that core & allow it to be written, synchronize all the threads & then read buffer2 afterwards. Is this possible?
let mut buffer2 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let mut handles = Vec::new();
for c in 0..num_logical_cores {
let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
let handle = thread::spawn(move || {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
per_core_buffer[y][i] = compute(i as i32,j as i32)
}
}
});
handles.push(handle)
}
for handle in handles {
handle.join().unwrap();
}
The error is this & I don't understand:
error[E0597]: `buffer2` does not live long enough
--> src/main.rs:50:36
|
50 | let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
| ^^^^^^^ borrowed value does not live long enough
...
88 | }
| - borrowed value only lives until here
|
= note: borrowed value must be valid for the static lifetime...
The full testcase is:
extern crate num_cpus;
use std::time::Instant;
use std::thread;
fn compute(x: i32, y: i32) -> i32 {
(x*y) % (x+y+10000)
}
fn main() {
let num_logical_cores = num_cpus::get();
const WIDTH: usize = 40000;
const HEIGHT: usize = 10000;
let y_per_core = HEIGHT/num_logical_cores + 1;
// ------------------------------------------------------------
// Serial Calculation...
let mut buffer0 = vec![vec![0i32; WIDTH]; HEIGHT];
let start0 = Instant::now();
for j in 0..HEIGHT {
for i in 0..WIDTH {
buffer0[j][i] = compute(i as i32,j as i32);
}
}
let dur0 = start0.elapsed();
// ------------------------------------------------------------
// On the way to Parallel Calculation...
// Reorder the data buffer to be 3D with one 2D region per core.
let mut buffer1 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let start1 = Instant::now();
for c in 0..num_logical_cores {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
buffer1[c][y][i] = compute(i as i32,j as i32)
}
}
}
let dur1 = start1.elapsed();
// ------------------------------------------------------------
// Actual Parallel Calculation...
let mut buffer2 = vec![vec![vec![0i32; WIDTH]; y_per_core]; num_logical_cores];
let mut handles = Vec::new();
let start2 = Instant::now();
for c in 0..num_logical_cores {
let per_core_buffer = &mut buffer2[c]; // <<< lifetime error
let handle = thread::spawn(move || {
for y in 0..y_per_core {
let j = y*num_logical_cores + c;
if j >= HEIGHT {
break;
}
for i in 0..WIDTH {
per_core_buffer[y][i] = compute(i as i32,j as i32)
}
}
});
handles.push(handle)
}
for handle in handles {
handle.join().unwrap();
}
let dur2 = start2.elapsed();
println!("Runtime: Serial={0:.3}ms, AlmostParallel={1:.3}ms, Parallel={2:.3}ms",
1000.*dur0.as_secs() as f64 + 1e-6*(dur0.subsec_nanos() as f64),
1000.*dur1.as_secs() as f64 + 1e-6*(dur1.subsec_nanos() as f64),
1000.*dur2.as_secs() as f64 + 1e-6*(dur2.subsec_nanos() as f64));
// Sanity check
for j in 0..HEIGHT {
let c = j % num_logical_cores;
let y = j / num_logical_cores;
for i in 0..WIDTH {
if buffer0[j][i] != buffer1[c][y][i] {
println!("wtf1? {0} {1} {2} {3}",i,j,buffer0[j][i],buffer1[c][y][i])
}
if buffer0[j][i] != buffer2[c][y][i] {
println!("wtf2? {0} {1} {2} {3}",i,j,buffer0[j][i],buffer2[c][y][i])
}
}
}
}
Thanks to #Shepmaster for the pointers and clarification that this is not an easy problem for Rust, and that I needed to consider crates to find a reasonable solution. I'm only just starting out in Rust, so this really wasn't clear to me.
I liked the ability to control the number of threads that scoped_threadpool gives, so I went with that. Translating my code from above directly, I tried to use the 4D buffer with core as the most-significant-index and that ran into troubles because that 3D vector does not implement the Copy trait. The fact that it implements Copy makes me concerned about performance, but I went back to the original problem and implemented it more directly & found a reasonable speedup by making each row a thread. Copying each row will not be a large memory overhead.
The code that works for me is:
let mut buffer2 = vec![vec![0i32; WIDTH]; HEIGHT];
let mut pool = Pool::new(num_logical_cores as u32);
pool.scoped(|scope| {
let mut y = 0;
for e in &mut buffer2 {
scope.execute(move || {
for x in 0..WIDTH {
(*e)[x] = compute(x as i32,y as i32);
}
});
y += 1;
}
});
On a 6 core, 12 thread i7-8700K for 400000x4000 testcase this runs in 3.2 seconds serially & 481ms in parallel--a reasonable speedup.
EDIT: I continued to think about this issue and got a suggestion from Rustlang on twitter that I should consider rayon. I converted my code to rayon and got similar speedup with the following code.
let mut buffer2 = vec![vec![0i32; WIDTH]; HEIGHT];
buffer2
.par_iter_mut()
.enumerate()
.map(|(y,e): (usize, &mut Vec<i32>)| {
for x in 0..WIDTH {
(*e)[x] = compute(x as i32,y as i32);
}
})
.collect::<Vec<_>>();

Is it normal to experience large overhead using the 1:1 threading that comes in the standard library?

While working through learning Rust, a friend asked me to see what kind of performance I could get out of Rust for generating the first 1 million prime numbers both single-threaded and multi-threaded. After trying several implementations, I'm just stumped. Here is the kind of performance that I'm seeing:
rust_primes --threads 8 --verbose --count 1000000
Options { verbose: true, count: 1000000, threads: 8 }
Non-concurrent using while (15485863): 2.814 seconds.
Concurrent using mutexes (15485863): 876.561 seconds.
Concurrent using channels (15485863): 798.217 seconds.
Without overloading the question with too much code, here are the methods responsible for each of the benchmarks:
fn non_concurrent(options: &Options) {
let mut count = 0;
let mut current = 0;
let ts = Instant::now();
while count < options.count {
if is_prime(current) {
count += 1;
}
current += 1;
}
let d = ts.elapsed();
println!("Non-concurrent using while ({}): {}.{} seconds.", current - 1, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_mutex(options: &Options) {
let count = Arc::new(Mutex::new(0));
let highest = Arc::new(Mutex::new(0));
let mut cc = 0;
let mut current = 0;
let ts = Instant::now();
while cc < options.count {
let mut handles = vec![];
for x in current..(current + options.threads) {
let count = Arc::clone(&count);
let highest = Arc::clone(&highest);
let handle = thread::spawn(move || {
if is_prime(x) {
let mut c = count.lock().unwrap();
let mut h = highest.lock().unwrap();
*c += 1;
if x > *h {
*h = x;
}
}
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
cc = *count.lock().unwrap();
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using mutexes ({}): {}.{} seconds.", *highest.lock().unwrap(), d.as_secs(), d.subsec_nanos() / 1_000_000);
}
fn concurrent_channel(options: &Options) {
let mut count = 0;
let mut current = 0;
let mut highest = 0;
let ts = Instant::now();
while count < options.count {
let (tx, rx) = mpsc::channel();
for x in current..(current + options.threads) {
let txc = mpsc::Sender::clone(&tx);
thread::spawn(move || {
if is_prime(x) {
txc.send(x).unwrap();
}
});
}
drop(tx);
for message in rx {
count += 1;
if message > highest && count <= options.count {
highest = message;
}
}
current += options.threads;
}
let d = ts.elapsed();
println!("Concurrent using channels ({}): {}.{} seconds.", highest, d.as_secs(), d.subsec_nanos() / 1_000_000);
}
Am I doing something wrong, or is this normal performance with the 1:1 threading that comes in the standard library?
Here is a MCVE that shows the same problem. I didn't limit the number of threads it starts up at once here like I did in the code above. The point is, threading seems to have a very significant overhead unless I'm doing something horribly wrong.
use std::thread;
use std::time::Instant;
use std::sync::{Mutex, Arc};
use std::time::Duration;
fn main() {
let iterations = 100_000;
non_threaded(iterations);
threaded(iterations);
}
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Arc::new(Mutex::new(0));
let mut handles = vec![];
for _ in 0..iterations {
let counter = Arc::clone(&counter);
let handle = thread::spawn(move || {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
handles.push(handle);
}
for handle in handles {
handle.join().unwrap();
}
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
fn non_threaded(iterations: u32) {
let tx = Instant::now();
let mut _q = 0;
for x in 0..iterations {
_q = test(x + 1);
}
let d = tx.elapsed();
println!("Non-threaded in {}.", dur_to_string(d));
}
fn dur_to_string(d: Duration) -> String {
let mut s = d.as_secs().to_string();
s.push_str(".");
s.push_str(&(d.subsec_nanos() / 1_000_000).to_string());
s
}
fn test(x: u32) -> u32 {
x
}
Here are the results of this on my machine:
Non-threaded in 0.9.
Threaded in 5.785.
threading seems to have a very significant overhead
It's not the general concept of "threading", it's the concept of creating and destroying lots of threads.
By default in Rust 1.22.1, each spawned thread allocates 2MiB of memory to use as stack space. In the worst case, your MCVE could allocate ~200GiB of RAM. In reality, this is unlikely to happen as some threads will exit, memory will be reused, etc. I only saw it use ~400MiB.
On top of that, there is overhead involved with inter-thread communication (Mutex, channels, Atomic*) compared to intra-thread variables. Some kind of locking needs to be performed to ensure that all threads see the same data. "Embarrassingly parallel" algorithms tend to not have a lot of communication required. There are also different amounts of time required for different communication primitives. Atomic variables tend to be faster than others in many cases, but aren't as widely usable.
Then there's compiler optimizations to account for. Non-threaded code is way easier to optimize compared to threaded code. For example, running your code in release mode shows:
Non-threaded in 0.0.
Threaded in 142.775.
That's right, the non-threaded code took no time. The compiler can see through the code and realizes that nothing actually happens and removes it all. I don't know how you got 5 seconds for the threaded code as opposed to the 2+ minutes I saw.
Switching to a threadpool will reduce a lot of the unneeded creation of threads. We can also use a threadpool that provides scoped threads, which allows us to avoid the Arc as well:
extern crate scoped_threadpool;
use scoped_threadpool::Pool;
fn threaded(iterations: u32) {
let tx = Instant::now();
let counter = Mutex::new(0);
let mut pool = Pool::new(8);
pool.scoped(|scope| {
for _ in 0..iterations {
scope.execute(|| {
let mut num = counter.lock().unwrap();
*num = test(*num);
});
}
});
let d = tx.elapsed();
println!("Threaded in {}.", dur_to_string(d));
}
Non-threaded in 0.0.
Threaded in 0.675.
As with most pieces of programming, it's crucial to understand the tools you have and to use them appropriately.

How do I use a Condvar to limit multithreading?

I'm trying to use a Condvar to limit the number of threads that are active at any given time. I'm having a hard time finding good examples on how to use Condvar. So far I have:
use std::sync::{Arc, Condvar, Mutex};
use std::thread;
fn main() {
let thread_count_arc = Arc::new((Mutex::new(0), Condvar::new()));
let mut i = 0;
while i < 100 {
let thread_count = thread_count_arc.clone();
thread::spawn(move || {
let &(ref num, ref cvar) = &*thread_count;
{
let mut start = num.lock().unwrap();
if *start >= 20 {
cvar.wait(start);
}
*start += 1;
}
println!("hello");
cvar.notify_one();
});
i += 1;
}
}
The compiler error given is:
error[E0382]: use of moved value: `start`
--> src/main.rs:16:18
|
14 | cvar.wait(start);
| ----- value moved here
15 | }
16 | *start += 1;
| ^^^^^ value used here after move
|
= note: move occurs because `start` has type `std::sync::MutexGuard<'_, i32>`, which does not implement the `Copy` trait
I'm entirely unsure if my use of Condvar is correct. I tried staying as close as I could to the example on the Rust API. Wwat is the proper way to implement this?
Here's a version that compiles:
use std::{
sync::{Arc, Condvar, Mutex},
thread,
};
fn main() {
let thread_count_arc = Arc::new((Mutex::new(0u8), Condvar::new()));
let mut i = 0;
while i < 100 {
let thread_count = thread_count_arc.clone();
thread::spawn(move || {
let (num, cvar) = &*thread_count;
let mut start = cvar
.wait_while(num.lock().unwrap(), |start| *start >= 20)
.unwrap();
// Before Rust 1.42, use this:
//
// let mut start = num.lock().unwrap();
// while *start >= 20 {
// start = cvar.wait(start).unwrap()
// }
*start += 1;
println!("hello");
cvar.notify_one();
});
i += 1;
}
}
The important part can be seen from the signature of Condvar::wait_while or Condvar::wait:
pub fn wait_while<'a, T, F>(
&self,
guard: MutexGuard<'a, T>,
condition: F
) -> LockResult<MutexGuard<'a, T>>
where
F: FnMut(&mut T) -> bool,
pub fn wait<'a, T>(
&self,
guard: MutexGuard<'a, T>
) -> LockResult<MutexGuard<'a, T>>
This says that wait_while / wait consumes the guard, which is why you get the error you did - you no longer own start, so you can't call any methods on it!
These functions are doing a great job of reflecting how Condvars work - you give up the lock on the Mutex (represented by start) for a while, and when the function returns you get the lock again.
The fix is to give up the lock and then grab the lock guard return value from wait_while / wait. I've also switched from an if to a while, as encouraged by huon.
For reference, the usual way to have a limited number of threads in a given scope is with a Semaphore.
Unfortunately, Semaphore was never stabilized, was deprecated in Rust 1.8 and was removed in Rust 1.9. There are crates available that add semaphores on top of other concurrency primitives.
let sema = Arc::new(Semaphore::new(20));
for i in 0..100 {
let sema = sema.clone();
thread::spawn(move || {
let _guard = sema.acquire();
println!("{}", i);
})
}
This isn't quite doing the same thing: since each thread is not printing the total number of the threads inside the scope when that thread entered it.
I realized the code I provided didn't do exactly what I wanted it to, so I'm putting this edit of Shepmaster's code here for future reference.
use std::sync::{Arc, Condvar, Mutex};
use std::thread;
fn main() {
let thread_count_arc = Arc::new((Mutex::new(0u8), Condvar::new()));
let mut i = 0;
while i < 150 {
let thread_count = thread_count_arc.clone();
thread::spawn(move || {
let x;
let &(ref num, ref cvar) = &*thread_count;
{
let start = num.lock().unwrap();
let mut start = if *start >= 20 {
cvar.wait(start).unwrap()
} else {
start
};
*start += 1;
x = *start;
}
println!("{}", x);
{
let mut counter = num.lock().unwrap();
*counter -= 1;
}
cvar.notify_one();
});
i += 1;
}
println!("done");
}
Running this in the playground should show more or less expected behavior.
You want to use a while loop, and re-assign start at each iteration, like:
fn main() {
let thread_count_arc = Arc::new((Mutex::new(0), Condvar::new()));
let mut i = 0;
while i < 100 {
let thread_count = thread_count_arc.clone();
thread::spawn(move || {
let &(ref num, ref cvar) = &*thread_count;
let mut start = num.lock().unwrap();
while *start >= 20 {
let current = cvar.wait(start).unwrap();
start = current;
}
*start += 1;
println!("hello");
cvar.notify_one();
});
i += 1;
}
}
See also some article on the topic:
https://medium.com/#polyglot_factotum/rust-concurrency-five-easy-pieces-871f1c62906a
https://medium.com/#polyglot_factotum/rust-concurrency-patterns-condvars-and-locks-e278f18db74f

Resources