Poor Parallelization Scaling with Rust

Poor Parallelization Scaling with Rust - multithreading

Currently, working on learning parallelization and am investigating why a test program I wrote is not scaling well. I have a simple program that does a CPU bound computation through L iterations and spreads those iterations across the number of threads in the test (from 1 to 8). While I don't expect perfect scaling (8 threads is 8 times faster than 1 thread), the scaling I am seeing seems bad enough that I believe there must be something I am missing.
I'm assuming that there is either something wrong with my code or that there's some aspect to parallelization that I'm missing.
Things that I feel can be ruled out:
The work being done uses only local variables so I don't believe memory bandwidth or cache issues are a problem.
I have tried this test with each thread pinned to a different core and did not see any improvement performance.
Hardware:
Lenovo T495
Operating System: Fedora 32
KDE Plasma Version: 5.18.5
KDE Frameworks Version: 5.75.0
Qt Version: 5.14.2
Kernel Version: 5.11.13-100.fc32.x86_64
OS Type: 64-bit
Processors: 8 × AMD Ryzen 5 PRO 3500U w/ Radeon Vega Mobile Gfx
Memory: 21.5 GiB of RAM
Here's the code I wrote:
use std::thread;
use std::time::Instant;
fn main() {
let loops = 10_000_000_000;
for threads in 1..=8 {
// As threads are added to the test, evenly split the total number of iterations
// across all threads, so that 1 thread test can be compared to 4 thread test.
// For `threads` that are not divisors of `loops` some threads may have one more
// iteration than the others but that will be 1 out of 10,000,000 and should have
// negligible effect on the run time.
n_threads(threads, loops / threads);
}
}
/// Have `num_threads` threads each run a function that will
/// iterate a computation `loops` times.
fn n_threads(num_threads: usize, loops: usize) {
let sw = Instant::now();
let mut threads = Vec::new();
for _ in 0..num_threads {
let t = thread::spawn(move || {
let sw = Instant::now();
let v = work(loops);
(v, sw.elapsed().as_millis())
});
threads.push(t);
}
let mut durations = vec![0; num_threads];
let mut idx = 0;
for t in threads.into_iter() {
let (_, dur) = t.join().unwrap();
durations[idx] = dur;
idx += 1;
}
let time = sw.elapsed();
let avg = durations.iter().sum::<u128>() as f64 / num_threads as f64;
println!("{}, {}, {}", num_threads, time.as_millis(), avg);
}
fn work(loops: usize) -> f64 {
let mut x = 0.5;
for i in 0..loops {
x += (i as f64 / 10000.).sin();
}
x
}
When I run my test, I get the following results:
| Threads | Time (ms) | Scale Factor |
| -------:| ---------:| ------------:|
| 1 | 1702 | 1 |
| 2 | 993 | 1.713997986 |
| 3 | 757 | 2.248348745 |
| 4 | 650 | 2.618461538 |
| 5 | 582 | 2.924398625 |
| 6 | 495 | 3.438383838 |
| 7 | 475 | 3.583157895 |
| 8 | 455 | 3.740659341 |
Here's a chart showing the change in time to run the test vs the number of threads for the computation:
Here's a chart showing the performance multiplier vs threads along with a perfect multiplier:
Updated Test with 10,000,000,000 Total Iterations Spread Across Threads
Per request for a test that took longer, I've increased the number of iterations by 100x. I've also moved the timing to within the thread (and updated the code above):
Thread | Avg In Thread Time | Times Faster
1 | 155564 | 1
2 | 79400.5 | 1.959232
3 | 57965 | 2.683757
4 | 47753.25 | 3.257663
5 | 42054.6 | 3.699096
6 | 40028.66667 | 3.886315
7 | 39479.28571 | 3.940396
8 | 37376.625 | 4.162067

Contrary to what CPU manufacturers would like you to believe, hyperthreading is not the same as physical cores. In particular, hyperthreading is only effective when the threads run different operations at any given time (the threads may be running the same algorithm, but then HT is only useful if one thread is waiting for the cache while the other is running).
In your case, you get a 3.25× performance increase for 4 threads on 4 physical cores, which is not completely unreasonable depending on the work and overall system load. When running more than 4 threads, you get threads that run on the same core, and must share the same FPU which can only do one operation at a time, explaining why you can't get much more than a 4× performance increase.

Test is very short
Includes the time spawning the threads
Real cores vs smt
Freq scaling, power states, parking

Related

Concurrent modification of data

I'm trying to concurrently write to different parts of an image from different threads - My attempt at a multithreaded approach to rendering the mandelbrot set. I know that each thread will be writing to different pixels (no crossover) so that it is safe to do so. But I'm stuck with how to share the mutable image buffer across the threads without using a mutex (because that would defeat the purpose of having multithreading).
I am using the image crate for the ImageBuffer.
My current code is this:
/// Generates an RGB image of the specified fractal, with given
/// dimensions, and a defined transformation from the image coordinate
/// plane to the complex plane, and the max_iterations is the amount
/// of detail (50-100 being low, >=1000 being high, default 100)
pub fn generate_fractal_image(&self, fractal_type: FractalType, dimensions: (u32, u32), transform: &PlaneTransform<f64>, max_iterations: Option<u32>) -> RgbImage {
let (width, height) = dimensions;
let mut img = ImageBuffer::new(width, height);
let transform = transform.clone();
for (_, row) in img.enumerate_rows_mut() {
self.thread_pool.assign_job(move || {
for (_, (x, y, pixel)) in row.enumerate() {
let rgb = match fractal_type {
FractalType::MandelbrotSet => mandelbrot::calculate_pixel(x, y, &transform, max_iterations),
FractalType::JuliaSet => julia::calculate_pixel(x, y, &transform, max_iterations)
};
*pixel = image::Rgb([rgb.r as u8, rgb.g as u8, rgb.b as u8]);
}
});
}
img
}
And I get these errors:
error[E0597]: `img` does not live long enough
--> src/lib.rs:38:20
|
38 | for (_, row) in img.enumerate_rows_mut() {
| ^^^---------------------
| |
| borrowed value does not live long enough
| argument requires that `img` is borrowed for `'static`
...
52 | }
| - `img` dropped here while still borrowed
error[E0505]: cannot move out of `img` because it is borrowed
--> src/lib.rs:51:4
|
38 | for (_, row) in img.enumerate_rows_mut() {
| ------------------------
| |
| borrow of `img` occurs here
| argument requires that `img` is borrowed for `'static`
...
51 | img
| ^^^ move out of `img` occurs here
That makes sense, but I'm stuck as to how to share img between the threads.
I'm using a thread pool implementation based off of this chapter.
Still fairly new to rust so I'm definitely not doing things in necessarily the best way, or the "correct" way, so anything you can point out would be brilliant :)

How can I read and modify a variable using Rayon's par_iter in a thread-safe manner? [duplicate]

This question already has answers here:
Replace iter() with par_iter(): cannot borrow data mutably in a captured outer variable in an `Fn` closure
(2 answers)
Closed 1 year ago.
This code:
use rayon::prelude::*; // 1.5.0
fn main() {
let mut items = Vec::new();
items.push("hello");
items.push("foo");
items.push("bar");
items.push("ipsum");
let mut counter = 0;
let results = items.par_iter().map(|item| {
// do something time consuming with item
counter += 1;
print!("completed {} items\r", counter);
0
});
}
Produces an error:
warning: unused variable: `item`
--> src/main.rs:12:41
|
12 | let results = items.par_iter().map(|item| {
| ^^^^ help: if this is intentional, prefix it with an underscore: `_item`
|
= note: `#[warn(unused_variables)]` on by default
warning: unused variable: `results`
--> src/main.rs:12:9
|
12 | let results = items.par_iter().map(|item| {
| ^^^^^^^ help: if this is intentional, prefix it with an underscore: `_results`
error[E0594]: cannot assign to `counter`, as it is a captured variable in a `Fn` closure
--> src/main.rs:14:9
|
14 | counter += 1;
| ^^^^^^^^^^^^ cannot assign

Rust prevents you here from having a data race by writing to the same variable from two different threads. You have a couple of options how to solve this. It really depends on the specific circumstances.
The simplest is to use a Mutex for counter. This allows you safe access to the same variable. Introducing the Mutex has the risk of eating up all the speedup of the parallel iterator, since everything will get sequential through the Mutex access. This can be acceptable if the runtime of the map is large and locking the Mutex short.
For the specific case of counters atomic types such as AtomicI32 work well, but they are hard or impossible to use for more complex types.
Instead of directly aggregating on a single variable, the work can be done multiple times in parallel and then merged together. This is what the reduce-functions from rayon do. Each thread will have at least one counter and they will be merged together to produce a single final result.

How do I write a lazily evaluated double for loop in a functional style?

How do I write a lazily evaluated double for loop in a functional style in Rust?
The borrowed value is of type usize, which should be trivially copyable.
fn main() {
let numbers: Vec<i32> = (1..100).collect();
let len = numbers.len();
let _sums_of_pairs: Vec<_> = (0..len)
.map(|j| ((j + 1)..len).map(|k| numbers[j] + numbers[k]))
.flatten()
.collect();
}
error[E0373]: closure may outlive the current function, but it borrows `j`, which is owned by the current function
--> src/bin/example.rs:6:37
|
6 | .map(|j| ((j + 1)..len).map(|k| numbers[j] + numbers[k]))
| ^^^ - `j` is borrowed here
| |
| may outlive borrowed value `j`
|
note: closure is returned here
--> src/bin/example.rs:6:18
|
6 | .map(|j| ((j + 1)..len).map(|k| numbers[j] + numbers[k]))
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
help: to force the closure to take ownership of `j` (and any other referenced variables), use the `move` keyword
|
6 | .map(|j| ((j + 1)..len).map(move |k| numbers[j] + numbers[k]))
| ^^^^^^^^
error: aborting due to previous error
For more information about this error, try `rustc --explain E0373`.
Further Notes
I am aware that Itertools::combinations(2) does the job. However, I don't want to use it because (1) I want to know how to do it myself and (2) it might be the reason my code is slow, and I want to eliminate that source. (Update: Itertools::tuple_combinations<(_, _)>() is much, much faster and lets one code this in a functional style.)
I also tried collecting it into a container first. (0..len).collect::<Vec<_>>().iter().cloned().map(...)
I tried the suggested move but then numbers is also moved and hence not available in the next loop.
There is no threading or async happening anywhere in this code example.
Shepmaster says in this answer that I cannot make lifetime annotations on closures.
The reason I don't write two raw loops with early return is, that if I want to say, run .any() to find if a specific value is present, I'd have to move the two loops into a separate function as I cannot put return true; inside the loop unless it's in a separate function.

To work around the issue, you can borrow &numbers up front and just shadow numbers. Then after that you can add move to the second closure.
fn main() {
let numbers: Vec<i32> = (1..100).collect();
let len = numbers.len();
let numbers = &numbers;
let _sums_of_pairs: Vec<_> = (0..len)
.map(|j| ((j + 1)..len).map(move |k| numbers[j] + numbers[k]))
.flatten()
.collect();
}

I would suggest you to write more idiomatic Rust code with iterators instead of indexing. It is just simpler.
fn main() {
let numbers: Vec<i32> = (1..100).collect();
let _sums_of_pairs: Vec<_> = numbers.iter()
.enumerate()
.flat_map(|(j, &num)| numbers[j + 1..].iter().map(move |&k| k + num))
.collect();
}
If you still want use indexing (be aware that much less effective), this would work:
fn main() {
let numbers: Vec<i32> = (1..100).collect();
let len = numbers.len();
let _sums_of_pairs: Vec<_> = (0..len)
.flat_map(|j| {
let numbers = &numbers;
((j + 1)..len).map(move |k| numbers[j] + numbers[k])
})
.collect();
}
UPD:
Also, I wrote a benchmark to show that:
My version is not slower than #vallentin wrote. Difference is in matter of 1-2 microseconds.
Itertools version is not "100X slower" but just 4/3 times slower.
Benchmark available as the gist.
Here the results (itertools - using itertools crate, indexing - version from vallentin and OP preference, and iterators - my iterators solution):
itertools/100 time: [12.781 us 12.805 us 12.831 us]
Found 13 outliers among 100 measurements (13.00%)
13 (13.00%) low severe
itertools/200 time: [48.211 us 48.693 us 49.071 us]
indexing/100 time: [9.9299 us 9.9378 us 9.9467 us]
Found 14 outliers among 100 measurements (14.00%)
1 (1.00%) low mild
1 (1.00%) high mild
12 (12.00%) high severe
indexing/200 time: [39.582 us 39.654 us 39.720 us]
Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe
iterators/100 time: [9.7633 us 9.7809 us 9.8010 us]
Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe
iterators/200 time: [38.732 us 38.785 us 38.840 us]
Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe

Why does spawning threads using Iterator::map not run the threads in parallel?

I wrote a simple multithreaded application in Rust to add the numbers from 1 to x. (I know there is a formula for this, but the point was to write some multi threaded code in Rust, not to get the outcome.)
It worked fine, but after I refactored it to a more functional style instead of imperative, there was no more speedup from multithreading. When inspecting the CPU usage, it appears that only one core is used of my 4 core / 8 thread CPU. The original code has 790% CPU usage, while the refactored version only has 99%.
The original code:
use std::thread;
fn main() {
let mut handles: Vec<thread::JoinHandle<u64>> = Vec::with_capacity(8);
const thread_count: u64 = 8;
const batch_size: u64 = 20000000;
for thread_id in 0..thread_count {
handles.push(thread::spawn(move || {
let mut sum = 0_u64;
for i in thread_id * batch_size + 1_u64..(thread_id + 1) * batch_size + 1_u64 {
sum += i;
}
sum
}));
}
let mut total_sum = 0_u64;
for handle in handles.into_iter() {
total_sum += handle.join().unwrap();
}
println!("{}", total_sum);
}
The refactored code:
use std::thread;
fn main() {
const THREAD_COUNT: u64 = 8;
const BATCH_SIZE: u64 = 20000000;
// spawn threads that calculate a part of the sum
let handles = (0..THREAD_COUNT).map(|thread_id| {
thread::spawn(move ||
// calculate the sum of all numbers from assigned to this thread
(thread_id * BATCH_SIZE + 1 .. (thread_id + 1) * BATCH_SIZE + 1)
.fold(0_u64,|sum, number| sum + number))
});
// add the parts of the sum together to get the total sum
let sum = handles.fold(0_u64, |sum, handle| sum + handle.join().unwrap());
println!("{}", sum);
}
the outputs of the programs are the same (12800000080000000), but the refactored version is 5-6 times slower.
It appears that iterators are lazily evaluated. How can I force the entire iterator to be evaluated? I tried to collect it into an array of type [thread::JoinHandle<u64>; THREAD_COUNT as usize], but I then I get the following error:
--> src/main.rs:14:7
|
14 | ).collect::<[thread::JoinHandle<u64>; THREAD_COUNT as usize]>();
| ^^^^^^^ a collection of type `[std::thread::JoinHandle<u64>; 8]` cannot be built from `std::iter::Iterator<Item=std::thread::JoinHandle<u64>>`
|
= help: the trait `std::iter::FromIterator<std::thread::JoinHandle<u64>>` is not implemented for `[std::thread::JoinHandle<u64>; 8]`
Collecting into a vector does work, but that seems like a weird solution, because the size is known up front. Is there a better way then using a vector?

Iterators in Rust are lazy, so your threads are not started until handles.fold tries to access the corresponding element of the iterator. Basically what happens is:
handles.fold tries to access the first element of the iterator.
The first thread is started.
handles.fold calls its closure, which calls handle.join() for the first thread.
handle.join waits for the first thread to finish.
handles.fold tries to access the second element of the iterator.
The second thread is started.
and so on.
You should collect the handles into a vector before folding the result:
let handles: Vec<_> = (0..THREAD_COUNT)
.map(|thread_id| {
thread::spawn(move ||
// calculate the sum of all numbers from assigned to this thread
(thread_id * BATCH_SIZE + 1 .. (thread_id + 1) * BATCH_SIZE + 1)
.fold(0_u64,|sum, number| sum + number))
})
.collect();
Or you could use a crate like Rayon which provides parallel iterators.

Why is my Rust version of "wc" slower than the one from GNU coreutils?

Consider this program:
use std::io::BufRead;
use std::io;
fn main() {
let mut n = 0;
let stdin = io::stdin();
for _ in stdin.lock().lines() {
n += 1;
}
println!("{}", n);
}
Why is it over 10x as slow as the GNU version of wc? Take a look at how I measure it:
$ yes | dd count=1000000 | wc -l
256000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 1.16586 s, 439 MB/s
$ yes | dd count=1000000 | ./target/release/wc
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 41.685 s, 12.3 MB/s
256000000

There are many reasons why your code is way slower than the original wc. There are a few things you pay for which you actually don't need at all. By removing those, you can already get a quite substantial speed boost.
Heap allocations
BufRead::lines() returns an iterator which yields String elements. Due to this design, it will (it has to!) allocate memory for every single line. The lines() method is a convenient method to write code easily, but it isn't supposed to be used in high performance situations.
To avoid allocating heap memory for every single line, you can use BufRead::read_line() instead. The code is a bit more verbose, but as you can see, we are reusing the heap memory of s:
let mut n = 0;
let mut s = String::new();
let stdin = io::stdin();
let mut lock = stdin.lock();
loop {
s.clear();
let res = lock.read_line(&mut s);
if res.is_err() || res.unwrap() == 0 {
break;
}
n += 1;
}
println!("{}", n);
On my notebook, this results in:
$ yes | dd count=1000000 | wc -l
256000000
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 0,981827 s, 521 MB/s
$ yes | dd count=1000000 | ./wc
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 6,87622 s, 74,5 MB/s
256000000
As you can see, it improved things a lot, but is still not equivalent.
UTF-8 validation
Since we are reading into a String, we are validating the raw input from stdin to be proper UTF-8. This costs time! But we are only interested in the raw bytes, since we only need to count the newline characters (0xA). We can get rid of UTF-8 checks by using a Vec<u8> and BufRead::read_until():
let mut n = 0;
let mut v = Vec::new();
let stdin = io::stdin();
let mut lock = stdin.lock();
loop {
v.clear();
let res = lock.read_until(0xA, &mut v);
if res.is_err() || res.unwrap() == 0 {
break;
}
n += 1;
}
println!("{}", n);
This results in:
1000000+0 records in
1000000+0 records out
512000000 bytes (512 MB, 488 MiB) copied, 4,24162 s, 121 MB/s
256000000
That's a 60% improvement. But the original wc is still faster by a factor of 3.5x!
Further possible improvements
Now we used up all the low hanging fruit to boost performance. In order to match wcs speed, one would have to do some serious profiling, I think. In our current solution, perf reports the following:
around 11% of the time is spent in memchr; I don't think this can be improved
around 18% is spent in <StdinLock as std::io::BufRead>::fill_buf() and
around 6% is spent in <StdinLock as std::io::BufRead>::consume()
A huge part of the remaining time is spent in main directly (due to inlining). From the looks of it, we are also paying a bit for the cross platform abstraction. There is some time spent for Mutex methods and stuff.
But at this point, I'm just guessing, because I don't have the time to look into this further. Sorry :<
But please note, that wc is an old tool and is highly optimized for the platform it's running on and for the task it's performing. I guess that knowledge about Linux internal things would help performance a lot. This is really specialized, so I wouldn't expect to match the performance that easily.

This is because your version is in no way equivalent to the GNU one which doesn't allocate any memory for strings, but only moves the file pointer and increments different counters. In addition, it processes raw bytes while Rust's String must be valid UTF-8.
GNU wc source

Here's a version I got from Arnavion on #rust-beginners IRC:
use std::io::Read;
fn main() {
let mut buffer = [0u8; 1024];
let stdin = ::std::io::stdin();
let mut stdin = stdin.lock();
let mut wc = 0usize;
loop {
match stdin.read(&mut buffer) {
Ok(0) => {
break;
},
Ok(len) => {
wc += buffer[0..len].into_iter().filter(|&&b| b == b'\n').count();
},
Err(err) => {
panic!("{}", err);
},
}
};
println!("{}", wc);
}
This gets performance very close to what original wc does.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Poor Parallelization Scaling with Rust - multithreading

Test is very short Includes the time spawning the threads Real cores vs smt Freq scaling, power states, parking

Related

Concurrent modification of data

How can I read and modify a variable using Rayon's par_iter in a thread-safe manner? [duplicate]

How do I write a lazily evaluated double for loop in a functional style?

Why does spawning threads using Iterator::map not run the threads in parallel?

Why is my Rust version of "wc" slower than the one from GNU coreutils?

Categories

Resources