Is rayon's parallelism limited to the cores of the machine? - multithreading

I have the following toy Rust program:
use rayon::prelude::*;
use std::{env, thread, time};
/// Sleeps 1 second n times
fn seq_sleep(n: usize) {
for _ in 0..n {
thread::sleep(time::Duration::from_millis(1000));
}
}
/// Launches n threads that sleep 1 second
fn thread_sleep(n: usize) {
let mut handles = Vec::new();
for _ in 0..n {
handles.push(thread::spawn(|| {
thread::sleep(time::Duration::from_millis(1000))
}));
}
for handle in handles {
handle.join().unwrap();
}
}
/// Sleeps 1 seconds n times parallely using rayon
fn rayon_sleep(n: usize) {
let millis = vec![0; n];
millis
.par_iter()
.for_each(|_| thread::sleep(time::Duration::from_millis(1000)));
}
fn main() {
let args: Vec<String> = env::args().collect();
let n = args[1].parse::<usize>().unwrap();
let now = time::Instant::now();
seq_sleep(n);
println!("sequential: {:?}", now.elapsed());
let now = time::Instant::now();
thread_sleep(n);
println!("thread: {:?}", now.elapsed());
let now = time::Instant::now();
rayon_sleep(n);
println!("rayon: {:?}", now.elapsed());
}
Basically, I want to compare the degree of parallelism of i) sequential code, ii) basic threads, and iii) rayon. To do so, my program accepts one input parameter n and, depending on the method, it sleeps for 1 second n times.
For n = 8, I get the following output:
sequential: 8.016809707s
thread: 1.006029845s
rayon: 1.004957395s
So far so good. However, for n = 9, I get the following output:
sequential: 9.012422104s
thread: 1.003085005s
rayon: 2.011378713s
The sequential and basic thread versions make sense to me. However, I expected rayon to take 1 second. My machine has 4 cores and hyper threading. This leads me to think that rayon internally limits the number of parallel threads according to the cores/threads that your machine supports. Is this correct?

Yes:
rayon::ThreadPoolBuilder::build_global():
Initializes the global thread pool. This initialization is optional. If you do not call this function, the thread pool will be automatically initialized with the default configuration.
rayon::ThreadPoolBuilder::num_threads():
If num_threads is 0, or you do not call this function, then the Rayon runtime will select the number of threads automatically. At present, this is based on the RAYON_NUM_THREADS environment variable (if set), or the number of logical CPUs (otherwise). In the future, however, the default behavior may change to dynamically add or remove threads as needed.

Related

Alternative to swapping vector elements in rust

I'm experimenting with rust by porting some c++ code. I write a lot of code that uses vectors as object pools by moving elements to the back in various ways and then resizing. Here's a ported function:
use rand::{thread_rng, Rng};
fn main() {
for n in 1..11 {
let mut a: Vec<u8> = (1..11).collect();
keep_n_rand(&mut a, n);
println!("{}: {:?}", n, a);
}
}
fn keep_n_rand<T>(x: &mut Vec<T>, n: usize) {
let mut rng = thread_rng();
for i in n..x.len() {
let j = rng.gen_range(0..i);
if j < n {
x.swap(i, j);
}
}
x.truncate(n);
}
It keeps n elements chosen at random. It is done this way because it does not reduce the capacity of the vector so that more objects can be added later without allocating (on average). This might be iterated millions of times.
In c++, I would use x[j] = std::move(x[i]); because I am about to truncate the vector. While it has no impact in this example, if the swap was expensive, it would make sense to move. Is that possible and desirable in rust? I can live with a swap. I'm just curious.
Correct me if I'm wrong: you're looking for a way to retain n random elements in a Vec and discard the rest. In that case, the easiest way would be to use partial_shuffle(), a rand function implemented for slices.
Shuffle a slice in place, but exit early.
Returns two mutable slices from the source slice. The first contains amount elements randomly permuted. The second has the remaining elements that are not fully shuffled.
use rand::{thread_rng, seq::SliceRandom};
fn main() {
let mut rng = thread_rng();
// Use the `RangeInclusive` (`..=`) syntax at times like this.
for n in 1..=10 {
let mut elements: Vec<u8> = (1..=10).collect();
let (elements, _rest) = elements.as_mut_slice().partial_shuffle(&mut rng, n);
println!("{n}: {elements:?}");
}
}
Run this snippet on Rust Playground.
elements is shadowed, going from a Vec to a &mut [T]. If you're only going to use it inside the function, that's probably all you'll need. However, since it's a reference, you can't return it; the data it's pointing to is owned by the original vector, which will be dropped when it goes out of scope. If that's what you need, you'll have to turn the slice into a Vec.
While you can simply construct a new one from it using Vec::from, I suspect (but haven't tested) that it's more efficient to use Vec::split_off.
Splits the collection into two at the given index.
Returns a newly allocated vector containing the elements in the range [at, len). After the call, the original vector will be left containing the elements [0, at) with its previous capacity unchanged.
use rand::{thread_rng, seq::SliceRandom};
fn main() {
let mut rng = thread_rng();
for n in 1..=10 {
let mut elements: Vec<u8> = (1..=10).collect();
elements.as_mut_slice().partial_shuffle(&mut rng, n);
let elements = elements.split_off(elements.len() - n);
// `elements` is still a `Vec`; this time, containing only
// the shuffled elements. You can use it as the return value.
println!("{n}: {elements:?}");
}
}
Run this snippet on Rust Playground.
Since this function lives on a performance-critical path, I'd recommend benchmarking it against your current implementation. At the time of writing this, criterion is the most popular way to do that. That said, rand is an established library, so I imagine it will perform as well or better than a manual implementation.
Sample Benchmark
I don't know what kind of numbers you're working with, but here's a sample benchmark with for n in 1..=100 and (1..=100).collect() (i.e. 100 instead of 10 in both places) without the print statements:
manual time: [73.683 µs 73.749 µs 73.821 µs]
rand with slice time: [68.074 µs 68.147 µs 68.226 µs]
rand with vec time: [54.147 µs 54.213 µs 54.288 µs]
Bizarrely, splitting off a Vec performed vastly better than not. Unless I made an error in my benchmarks, the compiler is probably doing something under the hood that you'll need a more experienced Rustacean than me to explain.
Benchmark Implementation
Cargo.toml
[dependencies]
rand = "0.8.5"
[dev-dependencies]
criterion = "0.4.0"
[[bench]]
name = "rand_benchmark"
harness = false
[[bench]]
name = "rand_vec_benchmark"
harness = false
[[bench]]
name = "manual_benchmark"
harness = false
benches/manual_benchmark.rs
use criterion::{criterion_group, criterion_main, Criterion};
fn manual_solution() {
for n in 1..=100 {
let mut elements: Vec<u8> = (1..=100).collect();
keep_n_rand(&mut elements, n);
}
}
fn keep_n_rand<T>(elements: &mut Vec<T>, n: usize) {
use rand::{thread_rng, Rng};
let mut rng = thread_rng();
for i in n..elements.len() {
let j = rng.gen_range(0..i);
if j < n {
elements.swap(i, j);
}
}
elements.truncate(n);
}
fn benchmark(c: &mut Criterion) {
c.bench_function("manual", |b| b.iter(manual_solution));
}
criterion_group!(benches, benchmark);
criterion_main!(benches);
benches/rand_benchmark.rs
use criterion::{criterion_group, criterion_main, Criterion};
fn rand_solution() {
use rand::{seq::SliceRandom, thread_rng};
let mut rng = thread_rng();
for n in 1..=100 {
let mut elements: Vec<u8> = (1..=100).collect();
let (_elements, _) = elements.as_mut_slice().partial_shuffle(&mut rng, n);
}
}
fn benchmark(c: &mut Criterion) {
c.bench_function("rand with slice", |b| b.iter(rand_solution));
}
criterion_group!(benches, benchmark);
criterion_main!(benches);
benches/rand_vec_benchmark.rs
use criterion::{criterion_group, criterion_main, Criterion};
fn rand_solution() {
use rand::{seq::SliceRandom, thread_rng};
let mut rng = thread_rng();
for n in 1..=100 {
let mut elements: Vec<u8> = (1..=100).collect();
elements.as_mut_slice().partial_shuffle(&mut rng, n);
let _elements = elements.split_off(elements.len() - n);
}
}
fn benchmark(c: &mut Criterion) {
c.bench_function("rand with vec", |b| b.iter(rand_solution));
}
criterion_group!(benches, benchmark);
criterion_main!(benches);
Is that possible and desirable in rust?
It is not possible unless you constrain T: Copy or T: Clone: while C++ uses non-destructive moves (the source is in a valid but unspecified state) Rust uses destructive moves (the source is gone).
There are ways around it using unsafe but they require being very careful and it's probably not worth the hassle (you can look at Vec::swap_remove for a taste, it basically does what you're doing here except only between j and the last element of the vec).
I'd also recommend verified_tinker's solution, as I'm not convinced your shuffle is unbiased.

CPU time sleep instead of wall-clock time sleep

Currently, I have the following Rust toy program:
use rayon::prelude::*;
use std::{env, thread, time};
/// Sleeps 1 seconds n times parallely using rayon
fn rayon_sleep(n: usize) {
let millis = vec![0; n];
millis
.par_iter()
.for_each(|_| thread::sleep(time::Duration::from_millis(1000)));
}
fn main() {
let args: Vec<String> = env::args().collect();
let n = args[1].parse::<usize>().unwrap();
let now = time::Instant::now();
rayon_sleep(n);
println!("rayon: {:?}", now.elapsed());
}
Basically, my program accepts one input argument n. Then, I sleep for 1 second n times. The program executes the sleep tasks in parallel using rayon.
However, this is not exactly what I want. As far as I know, thread::sleep sleeps according to wall-clock time. However, I would like to keep a virtual CPU busy for 1 second in CPU time.
Is there any way to do this?
EDIT
I would like to make this point clear: I don't mind if the OS preempts the tasks. However, if this happens, then I don't want to consider the time the task spends in the ready/waiting queue.
EDIT
This is a simple, illustrative example of what I need to do. In reality, I have to develop a benchmark for a crate that allows defining and simulating models using the DEVS formalism. The benchmark aims to compare DEVS-compliant libraries with each other, and it explicitly says that the models must spend a fixed, known amount of CPU time. That is why I need to make sure of that. Thus, I cannot use a simple busy loop nor simply sleep.
I followed Sven Marnach's suggestions and implemented the following function:
use cpu_time::ThreadTime;
use rayon::prelude::*;
use std::{env, thread, time};
/// Sleeps 1 seconds n times parallely using rayon
fn rayon_sleep(n: usize) {
let millis = vec![0; n];
millis.par_iter().for_each(|_| {
let duration = time::Duration::from_millis(1000);
let mut x: u32 = 0;
let now = ThreadTime::now(); // get current thread time
while now.elapsed() < duration { // active sleep
std::hint::black_box(&mut x); // to avoid compiler optimizations
x = x.wrapping_add(1);
}
});
}
fn main() {
let args: Vec<String> = env::args().collect();
let n = args[1].parse::<usize>().unwrap();
let now = time::Instant::now();
rayon_sleep(n);
println!("rayon: {:?}", now.elapsed());
}
If I set n to 8, it takes 2 seconds more or less. I'd expect a better performance (1 second, as I have 8 vCPUs), but I guess that the overhead corresponds to the OS scheduling policy.

How to have seedable RNG in parallel in rust

I am learning rust by implementing a raytracer. I have a working prototype that is single threaded and I am trying to make it multithreaded.
In my code, I have a sampler which is basically a wrapper around StdRng::seed_from_u64(123) (this will change when I will add different types of samplers) that is mutable because of StdRNG. I need to have a repeatable behaviour that is why i am seeding the random number generator.
In my rendering loop I use the sampler in the following way
let mut sampler = create_sampler(&self.sampler_value);
let sample_count = sampler.sample_count();
println!("Rendering ...");
let progress_bar = get_progress_bar(image.size());
// Generate multiple rays for each pixel in the image
for y in 0..image.size_y {
for x in 0..image.size_x {
image[(x, y)] = (0..sample_count)
.into_iter()
.map(|_| {
let pixel = Vec2::new(x as f32, y as f32) + sampler.next2f();
let ray = self.camera.generate_ray(&pixel);
self.integrator.li(self, &mut sampler, &ray)
})
.sum::<Vec3>()
/ (sample_count as f32);
progress_bar.inc(1);
}
}
When I replace into_iter by par_into_iter the compiler tells me cannot borrow sampler as mutable, as it is a captured variable in a Fn closure
What should I do in this situation?
Thanks!
P.s. If it is of any use, this is the repo : https://github.com/jgsimard/rustrt
Even if Rust wasn't stopping you, you cannot just use a seeded PRNG with parallelism and get a reproducible result out.
Think about it this way: a PRNG with a certain seed/state produces a certain sequence of numbers. Reproducibility (determinism) requires not just that the numbers are the same, but that the way they are taken from the sequence is the same. But if you have multiple threads computing different pixels (different uses) which are racing with each other to fetch numbers from the single PRNG, then the pixels will fetch different numbers on different runs.
In order to get the determinism you want, you must deterministically choose which random number is used for which purpose.
One way to do this would be to make up an “image” of random numbers, computed sequentially, and pass that to the parallel loop. Then each ray has its own random number, which it can use as its seed for another PRNG that only that ray uses.
Another way that can be much more efficient and usable (because it doesn't require any sequentiality at all) is to use hash functions instead of PRNGs. Whenever you want a random number, use a hash function (like those which implement the std::hash::Hasher trait in Rust, but not necessarily the particular one std provides since it's not the fastest) to combine a bunch of information, like
the seed value
the pixel x and y location
which bounce or secondary ray of this pixel you're computing
into a single value which you can use as a pseudorandom number. This way, the “random” results are the same for the same circumstances (because you explicitly specfied that it should be computed from them) even if some other part of the program execution changes (whether that's a code change or a thread scheduling decision by the OS).
Your sampler is not thread-safe, if only because it is a &mut Sampler and mutable references cannot be shared between threads, obviously.
The easy thing would be to wrap it into an Arc<Mutex<Sampler>> and clone it to every closure. Something like (untested):
let sampler = Arc::new(Mutex::new(create_sampler(&self.sampler_value)));
//...
for y in 0..image.size_y {
for x in 0..image.size_x {
image[(x, y)] = (0..sample_count)
.par_into_iter()
.map({
let sampler = Arc::clone(sampler);
move |_| {
let mut sampler = sampler.lock().unwrap();
// use the sampler
}
})
.sum::<Vec3>() //...
But that may not be very efficient way because the mutex will be locked most of the time, and you will kill the paralellism. You may try locking/unlocking the mutex during the ray tracing and see if it improves.
The ideal solution would be to make the Sampler thread-safe and inner mutable, so that the next2f and friends do not need the &mut self part (Sampler::next2f(&self)). Again, the easiest way is having an internal mutex.
Or you can try going lock-less! I mean, your current implementation of that function is:
fn next2f(&mut self) -> Vec2 {
self.current_dimension += 2;
Vec2::new(self.rng.gen(), self.rng.gen())
}
You could replace the current_dimension with an AtomicI32 and the rng with a rand::thread_rng (also untested):
fn next2f(&self) -> Vec2 {
self.current_dimension.fetch_add(2, Ordering::SeqCst);
let mut rng = rand::thread_rng();
Vec2::new(rng.gen(), rng.gen())
}
This is the way I did it. I used this ressource : rust-random.github.io/book/guide-parallel.html. So I used ChaCha8Rng with the set_stream function to get seedable PRNG in parallel. I had to put the image[(x, y)] outside of the iterator because into_par_iter does not allow mutable borrow inside a closure. If you see something dumb in my solution, please tell me!
let size_x = image.size_x;
let img: Vec<Vec<Vec3>> = (0..image.size_y)
.into_par_iter()
.map(|y| {
(0..image.size_x)
.into_par_iter()
.map(|x| {
let mut rng = ChaCha8Rng::seed_from_u64(sampler.seed());
rng.set_stream((y * size_x + x) as u64);
let v = (0..sample_count)
.into_iter()
.map(|_| {
let pixel = Vec2::new(x as f32, y as f32) + sampler.next2f(&mut rng);
let ray = self.camera.generate_ray(&pixel);
self.integrator.li(self, &sampler, &mut rng, &ray)
})
.sum::<Vec3>()
/ (sample_count as f32);
progress_bar.inc(1);
v
}).collect()
}).collect();
for (y, row) in img.into_iter().enumerate() {
for (x, p) in row.into_iter().enumerate() {
image[(x, y)] = p;
}
}

getting HC-SR04 ultrasonic sensor data from stm32f411 with rust HAL yields constant value independent of sensor condition

So I want to get the distance in cm from my sensor, I already did it with Arduino C and an Arduino compatible board. Now I want to do this with stm32, below is my code (leaving out the conversion of pulse length to sound, as the delta time is constant already at this point.
#![deny(unsafe_code)]
#![allow(clippy::empty_loop)]
#![no_main]
#![no_std]
use panic_halt as _; // panic handler
use cortex_m_rt::{entry, interrupt};
use stm32f4xx_hal as hal;
use crate::hal::{pac, prelude::*};
use stm32f4xx_hal::delay::Delay;
use rtt_target::{rtt_init_print, rprintln};
use stm32f4xx_hal::timer::{Counter, Timer, SysCounter, CounterUs};
use cortex_m::peripheral::SYST;
use stm32f4xx_hal::time::Hertz;
use core::fmt::Debug;
use stm32f4xx_hal::pac::TIM2;
use core::pin::Pin;
fn dbg<T: Debug>(d: T, tag: &str) -> T {
rprintln!("{} {:?}", tag, d);
d
}
fn waste(c_us: &CounterUs<TIM2>, us: u32) {
let ts1 = c_us.now().ticks();
while (c_us.now().ticks() - ts1) < us {}
}
fn waste_until<T>(c_us: &CounterUs<TIM2>,
predicate: fn(_: &T) -> bool,
dt: &T,
us: u32) -> u32 {
let ts1 = c_us.now().ticks();
while (c_us.now().ticks() - ts1) < us && !predicate(dt) {}
return c_us.now().ticks() - ts1;
}
#[entry]
fn main() -> ! {
if let (Some(dp), Some(cp)) = (
pac::Peripherals::take(),
cortex_m::peripheral::Peripherals::take(),
) {
rtt_init_print!();
let gpioa = dp.GPIOA.split();
let mut trig = gpioa.pa3.into_push_pull_output();
let mut echo = gpioa.pa4.into_pull_up_input();
let rcc = dp.RCC.constrain();
let clocks = rcc.cfgr.freeze();
let mut counter = Timer::new(dp.TIM2, &clocks).counter_us();
counter.start(1_000_000_u32.micros()).unwrap();
loop {
trig.set_low();
waste(&counter, 2);
trig.set_high();
waste(&counter, 10);
trig.set_low();
let _ = waste_until(&counter, |c|c.is_high(),&echo, 1000);
let pulse_duration = waste_until(&counter, |c| c.is_low(),&echo, 1000);
rprintln!("{}", pulse_duration);
}
}
loop {}
}
I know that the code at this point does not stop the evaluation of the data in the case of timeout in the waste_until function, but given that there is an object less then 10 cm from sensor (which has a range of up to 2 meters) it shouldn't be causing issues.
I have few things I don't understand completely, which I assume might be the cause of this behavior.
First of all, I'm not sure if hardware timers loop, or have to be reset manually. (I used my waste function with half a second delay and managed to make seemingly ok blinky program, so i hope i got it correct).
I'm not sure if i have to configure my TIM2 maximum sampling frequency as in theory I could do it with sysclock, but i didn't find a way to do it with TIM2. Also I assumed that it wouldn't let me create CounterUs without minimum valid sample rate.
I'm not sure if ticks() are in one to one relation with microseconds (only assumed so, because it seemed logical that CounterUs would do that).
I'm not sure about the problems which might occur if timer loops mid wait and delta time becomes negative (in case of u32 just overflows).
When it comes to pull_up_input and pull_down_input does pull_up refer to the fact that pin is usually pulled high, and to trigger logical one it has to go low or that it has to be pulled high to get logical one? (Also it is not very clear if the is_low() and is_high() methods refer to the state of the pin, or logical value of the pin?)
I spent quite some time on this thing, but sadly to no avail so far. Hopefully someone can tell me if one of the things above is wrong and indeed causes the issue, or if its not something I considered helped me to see it.
(Value I'm getting is 1000 - 1001)
So from one of the comments I found out about the pull down and pull up resistors and watched couple YouTube videos on the matter. Not sure if this is correct, but from what I've found it seems that in fact i need a pull_down_input for echo pin. So I replaced it and the value
I'm getting is still constant but it's 1 now.
Now that makes some sense, since I assume that 1000 was originating from the timeout value in my waste. But getting 1 is a bit more confusing, I mean it cannot be faster then 1 us, right?
So after experimenting some more, I've ended up with this version of the code:
#![deny(unsafe_code)]
#![allow(clippy::empty_loop)]
#![no_main]
#![no_std]
use panic_halt as _; // panic handler
use cortex_m_rt::{entry, interrupt};
use stm32f4xx_hal as hal;
use crate::hal::{pac, prelude::*};
use stm32f4xx_hal::delay::Delay;
use rtt_target::{rtt_init_print, rprintln};
use stm32f4xx_hal::timer::{Counter, Timer, SysCounter, CounterUs};
use cortex_m::peripheral::SYST;
use stm32f4xx_hal::time::Hertz;
use core::fmt::Debug;
use stm32f4xx_hal::pac::TIM2;
use core::pin::Pin;
use cortex_m::asm::nop;
fn dbg<T: Debug>(d: T, tag: &str) -> T {
rprintln!("{} {:?}", tag, d);
d
}
fn waste(c_us: &CounterUs<TIM2>, us: u32) {
let ts1 = c_us.now().ticks();
while (c_us.now().ticks() - ts1) < us {}
}
fn waste_until<T>(c_us: &CounterUs<TIM2>,
predicate: fn(_: &T) -> bool,
dt: &T,
us: u32) -> Option<u32> {
let ts1 = c_us.now().ticks();
while (c_us.now().ticks() - ts1) < us && !predicate(dt) {
}
if predicate(dt) {Some(c_us.now().ticks() - ts1)} else {None}
}
#[entry]
fn main() -> ! {
if let (Some(dp), Some(cp)) = (
pac::Peripherals::take(),
cortex_m::peripheral::Peripherals::take(),
) {
rtt_init_print!();
let gpioa = dp.GPIOA.split();
let mut trig = gpioa.pa4.into_push_pull_output();
let mut echo = gpioa.pa5.into_pull_down_input();
let rcc = dp.RCC.constrain();
let clocks = rcc.cfgr.freeze();
let mut counter = Timer::new(dp.TIM2, &clocks).counter_us();
counter.start(1_000_000_u32.micros()).unwrap();
loop {
// starting pulse
trig.set_low();
waste(&counter, 2);
trig.set_high();
waste(&counter, 10);
trig.set_low();
// ending pulse
// starting echo read
if let Some(_) = waste_until(&counter, |c|c.is_high(),&echo, 1_000_000) { // if didn't timeout
if let Some(pulse_duration) = waste_until(&counter, |c| c.is_low(),&echo, 1_000_000) { // if didn't timeout
rprintln!("{}", pulse_duration);
} else {
rprintln!("no falling edge");
}
} else {
rprintln!("no rising edge");
}
// end echo read
}
}
loop {}
}
And here it became clear that the pattern in fact was that first 1-3 readings output same value (so far I've seen 1, 21 and 41) and then it keeps timing out in the outer if.
I tried changing io pins because I considered that my poor solder job was to blame, and also inspected the pins with multimeter, they seem to be fine.
I'm not entirely sure but I think that given that sensor has a recommended VCC of 5 volts, and stlink-2 provides 3.3 volts to the board the sensor can preform worse (but once again the target object is at most 5 cm away).
Here are the images of my breadboard just in case i missed something.

Why does spawning threads using Iterator::map not run the threads in parallel?

I wrote a simple multithreaded application in Rust to add the numbers from 1 to x. (I know there is a formula for this, but the point was to write some multi threaded code in Rust, not to get the outcome.)
It worked fine, but after I refactored it to a more functional style instead of imperative, there was no more speedup from multithreading. When inspecting the CPU usage, it appears that only one core is used of my 4 core / 8 thread CPU. The original code has 790% CPU usage, while the refactored version only has 99%.
The original code:
use std::thread;
fn main() {
let mut handles: Vec<thread::JoinHandle<u64>> = Vec::with_capacity(8);
const thread_count: u64 = 8;
const batch_size: u64 = 20000000;
for thread_id in 0..thread_count {
handles.push(thread::spawn(move || {
let mut sum = 0_u64;
for i in thread_id * batch_size + 1_u64..(thread_id + 1) * batch_size + 1_u64 {
sum += i;
}
sum
}));
}
let mut total_sum = 0_u64;
for handle in handles.into_iter() {
total_sum += handle.join().unwrap();
}
println!("{}", total_sum);
}
The refactored code:
use std::thread;
fn main() {
const THREAD_COUNT: u64 = 8;
const BATCH_SIZE: u64 = 20000000;
// spawn threads that calculate a part of the sum
let handles = (0..THREAD_COUNT).map(|thread_id| {
thread::spawn(move ||
// calculate the sum of all numbers from assigned to this thread
(thread_id * BATCH_SIZE + 1 .. (thread_id + 1) * BATCH_SIZE + 1)
.fold(0_u64,|sum, number| sum + number))
});
// add the parts of the sum together to get the total sum
let sum = handles.fold(0_u64, |sum, handle| sum + handle.join().unwrap());
println!("{}", sum);
}
the outputs of the programs are the same (12800000080000000), but the refactored version is 5-6 times slower.
It appears that iterators are lazily evaluated. How can I force the entire iterator to be evaluated? I tried to collect it into an array of type [thread::JoinHandle<u64>; THREAD_COUNT as usize], but I then I get the following error:
--> src/main.rs:14:7
|
14 | ).collect::<[thread::JoinHandle<u64>; THREAD_COUNT as usize]>();
| ^^^^^^^ a collection of type `[std::thread::JoinHandle<u64>; 8]` cannot be built from `std::iter::Iterator<Item=std::thread::JoinHandle<u64>>`
|
= help: the trait `std::iter::FromIterator<std::thread::JoinHandle<u64>>` is not implemented for `[std::thread::JoinHandle<u64>; 8]`
Collecting into a vector does work, but that seems like a weird solution, because the size is known up front. Is there a better way then using a vector?
Iterators in Rust are lazy, so your threads are not started until handles.fold tries to access the corresponding element of the iterator. Basically what happens is:
handles.fold tries to access the first element of the iterator.
The first thread is started.
handles.fold calls its closure, which calls handle.join() for the first thread.
handle.join waits for the first thread to finish.
handles.fold tries to access the second element of the iterator.
The second thread is started.
and so on.
You should collect the handles into a vector before folding the result:
let handles: Vec<_> = (0..THREAD_COUNT)
.map(|thread_id| {
thread::spawn(move ||
// calculate the sum of all numbers from assigned to this thread
(thread_id * BATCH_SIZE + 1 .. (thread_id + 1) * BATCH_SIZE + 1)
.fold(0_u64,|sum, number| sum + number))
})
.collect();
Or you could use a crate like Rayon which provides parallel iterators.

Resources