Prefetching and yielding to "hide" cache misses in rust - rust

When C++20 got stackless coroutines some papers successfully used them to "hide" cache misses via prefetching and switching to another coroutine. As far as I can tell rust's async is also like stackless coroutines in that it is a "zero cost abstraction". Is there work similar to the ones I mentioned for implementing such techniques in rust? If not is there is anything fundamentally preventing one from doing something like that with async/await?
Edit: I wanted to give a very high level and simplified summary of what I understand that the papers propose:
we want to run a bunch of independent processes that look like the following
P1 = load1;proc1; load1';proc1'; load1'';proc1''; ...
P2 = load2;proc2; load2';proc2'; load2'';proc2''; ...
...
PN = loadN;procN; loadN';procN'; loadN'';procN''; ...
where all the loadI terms are likely to cause a cache miss. The authors leverage coroutines to (dynamically) interleave the processes so that the code executed looks like the following:
P =
prefetch1;prefetch2;...;prefetchN;
load1;proc1;prefetch1'; # yield to the scheduler
load2;proc2;prefetch2'; # yield to the scheduler
...
loadN;procN;prefetchN'; # yield to the scheduler
load1';proc1';prefetch1''; # yield to the scheduler
load2';proc2';prefetch2''; # yield to the scheduler
...
loadN';procN';prefetchN''; # yield to the scheduler
...

The full code for this post can be found here.
I didn't look super hard, but as far as I'm aware there is no existing research on the topic, so I decided to do a little bit of it myself. Unlike the mentioned papers, I took a simple approach to test if it was possible by creating a couple of simple linked lists and summing them:
pub enum List<T> {
Cons(T, Box<List<T>>),
Nil,
}
impl<T> List<T> {
pub fn new(iter: impl IntoIterator<Item = T>) -> Self {
let mut tail = List::Nil;
for item in iter {
tail = List::Cons(item, Box::new(tail));
}
tail
}
}
const END: i32 = 1024 * 1024 * 1024 / 16;
fn gen_lists() -> (List<i32>, List<i32>) {
let range = 1..=END;
(List::new(range.clone()), List::new(range))
}
Ok, a couple of big simple linked lists. I ran nine different algorithms in two different benchmarks to see how the prefetching affected things. The benchmarks are summing the lists in an owned fashion, where the list is destroyed during iteration, causing deallocation to be the bulk of the measured time, and summing them in a borrowed fashion, where the deallocation time is not measured. The various algorithms tested are really only three different algorithms implemented with three different techniques, Iterators, Generators, and async Streams.
The three algorithms are zip, which is iterating over both lists in lockstep, chain, which is iterating over the lists one after the other, and zip prefetch, where the prefetch and switch method is used while zipping the two lists together. The basic iterator looks like this:
pub struct ListIter<T>(List<T>);
impl<T> Iterator for ListIter<T> {
type Item = T;
fn next(&mut self) -> Option<T> {
let mut temp = List::Nil;
std::mem::swap(&mut temp, &mut self.0);
match temp {
List::Cons(t, next) => {
self.0 = *next;
Some(t)
}
List::Nil => None,
}
}
}
and the version with prefetching looks like this:
pub struct ListIterPrefetch<T>(List<T>);
impl<T> Iterator for ListIterPrefetch<T> {
type Item = T;
fn next(&mut self) -> Option<T> {
let mut temp = List::Nil;
std::mem::swap(&mut temp, &mut self.0);
match temp {
List::Cons(t, next) => {
self.0 = *next;
if let List::Cons(_, next) = &self.0 {
unsafe { prefetch_read_data::<List<T>>(&**next, 3) }
}
Some(t)
}
List::Nil => None,
}
}
}
There are also different implementations for Generators and Streams, as well as a version that operates over references but they all look pretty much the same, so I am omitting them for brevity. The test harness is pretty simple- it just takes in a name-function pair and times it:
type BenchFn<T> = fn(List<T>, List<T>) -> T;
fn bench(funcs: &[(&str, BenchFn<i32>)]) {
for (s, f) in funcs {
let (l, r) = gen_lists();
let now = Instant::now();
println!("bench: {s} result: {} time: {:?}", f(l, r), now.elapsed());
}
}
For example, usage with the basic iterator tests:
bench(&[
("iter::zip", |l, r| {
l.into_iter().zip(r).fold(0, |a, (l, r)| a + l + r)
}),
("iter::zip prefetch", |l, r| {
l.into_iter_prefetch()
.zip(r.into_iter_prefetch())
.fold(0, |a, (l, r)| a + l + r)
}),
("iter::chain", |l, r| l.into_iter().chain(r).sum()),
]);
The results of the test on my computer, which is an Intel(R) Core(TM) i5-8365U CPU with 24 Gb of RAM:
Bench owned
bench: iter::zip result: 67108864 time: 11.1873901s
bench: iter::zip prefetch result: 67108864 time: 19.3889487s
bench: iter::chain result: 67108864 time: 8.4363853s
bench: generator zip result: 67108864 time: 16.7242197s
bench: generator chain result: 67108864 time: 8.9897788s
bench: generator prefetch result: 67108864 time: 11.7599589s
bench: stream::zip result: 67108864 time: 14.339864s
bench: stream::chain result: 67108864 time: 7.7592133s
bench: stream::zip prefetch result: 67108864 time: 11.1455706s
Bench ref
bench: iter::zip result: 67108864 time: 1.1343996s
bench: iter::zip prefetch result: 67108864 time: 864.4865ms
bench: iter::chain result: 67108864 time: 1.4036277s
bench: generator zip result: 67108864 time: 1.1360857s
bench: generator chain result: 67108864 time: 1.740029s
bench: generator prefetch result: 67108864 time: 904.1086ms
bench: stream::zip result: 67108864 time: 1.0902568s
bench: stream::chain result: 67108864 time: 1.5683112s
bench: stream::zip prefetch result: 67108864 time: 1.2031745s
The result is the sum calculated, and time was the recorded time. When looking at the destructive summation benchmarks, a few things stand out:
The chain algorithm works the best. My guess is this is because this method improves cache locality for the allocator, which is where the vast majority of the time is spent.
Prefetching greatly improves the time with the Stream and generator versions, bring their times on par with the standard iterator.
Prefetching completely ruins the iterator strategy. This is why you always benchmark when doing these things. I also would not be surprised if this is a failure by the compiler to optimize properly, rather than the prefetching directly hurting performance.
When looking at the borrowing summation a few things stand out:
Without measuring deallocation times, the recorded times are much shorter. This is how I know deallocation is most of the above benchmark.
The chain method loses- apparently running in lockstep is the way to go.
Prefetching is the way to go for iterators and generators. In contrast to the previous benchmark, prefetching causes the iterator to be the fastest strategy, rather than the slowest.
Prefetching causes a slow down when using streams, although the streams just performed poorly overall.
This testing wasn't the most scientific, for a variety reasons, and isn't for a particularly realistic workload, but given the results I can confidantly say that a prefetch and switch strategy can definitely result in solid performance improvements if done right. I also omitted a fair bit of the testing code for brevity, and the full code can be found here.

Related

Computing u32 hash with FxHasher fast

I've been recently experimenting with different hash functions in Rust. Started off with the fasthash crate, where many algorithms are implemented; e.g., murmur3 is then called as
let hval = murmur3::hash32_with_seed(&tmp_read_buff, 123 as u32);
This works very fast (e.g., few seconds for 100000000 short inputs). I also stumbled upon FxHash, the algorithm used a lot internally in Firefox (at least initially?). I rolled my version of hashing a byte array with this algorithm as follows
use rustc_hash::FxHasher;
use std::hash::Hasher;
fn hash_with_fx(read_buff: &[u8]) -> u64 {
let mut hasher = FxHasher::default();
for el in read_buff {
hasher.write_u8(*el);
}
return hasher.finish();
}
This works, however, it's about 5x slower. I'd really like to know if I'm missing something apparent here/how could I achieve similar or better speeds to fasthash's e.g., murmur3. My intuition is that with FxHash, the core operation is very simple,
self.hash = self.hash.rotate_left(5).bitxor(i).wrapping_mul(K);
hence it should be one of the fastest.
The FxHasher documentation mentions that:
the speed of the hash function itself is much higher because it works on up to 8 bytes at a time.
But your algorithm completely removes this possibility because you are processing each byte individually. You can make it a lot faster by hashing in chunks of 8 bytes.
fn hash_with_fx(read_buff: &[u8]) -> u64 {
let mut hasher = FxHasher::default();
let mut chunks = read_buff.chunks_exact(8);
for bytes in &mut chunks {
// unwrap is ok because `chunks_exact` provides the guarantee that
// the `bytes` slice is exactly the requested length
let int = u64::from_be_bytes(bytes.try_into().unwrap());
hasher.write_u64(int);
}
for byte in chunks.remainder() {
hasher.write_u8(*byte);
}
hasher.finish()
}
For very small inputs (especially for numbers like 7 bytes) it may introduce a small extra overhead compared with your original code. But, for larger inputs, it ought to be significantly faster.
Bonus material
It should be possible to remove a few extra instructions in the loop by using unwrap_unchecked instead of unwrap. It's completely sound to do so in this case, but it may not be worth introducing unsafe code into your codebase. I would measure the difference before deciding to include unsafe code.

Why is running cargo bench faster than running release build?

I want to benchmark my Rust programs, and was comparing some alternatives to do that. I noted, however, that when running a benchmark with cargo bench and the bencher crate, the code runs consistently faster than when running a production build (cargo build --release) with the same code. For example:
Main code:
use dot_product;
const N: usize = 1000000;
use std::time;
fn main() {
let start = time::Instant::now();
dot_product::rayon_parallel([1; N].to_vec(), [2; N].to_vec());
println!("Time: {:?}", start.elapsed());
}
Average time: ~20ms
Benchmark code:
#[macro_use]
extern crate bencher;
use dot_product;
use bencher::Bencher;
const N: usize = 1000000;
fn parallel(bench: &mut Bencher) {
bench.iter(|| dot_product::rayon_parallel([1; N].to_vec(), [2; N].to_vec()))
}
benchmark_group!(benches, sequential, parallel);
benchmark_main!(benches);
Time: 5,006,199 ns/iter (+/- 1,320,975)
I tried the same with some other programs and cargo bench gives consistently faster results. Why could this happen?
As the comments suggested, you should use criterion::black_box() on all (final) results in the benchmarking code. This function does nothing - and simply gives back its only parameter - but is opaque to the optimizer, so the compiler has to assume the function does something with the input.
When not using black_box(), the benchmarking code doesn't actually do anything, as the compiler is able to figure out that the results of your code are unused and no side-effects can be observed. So it removes all your code during dead-code elimination and what you end up benchmarking is the benchmarking-suite itself.

Is `iter().map().sum()` as fast as `iter().fold()`?

Does the compiler generate the same code for iter().map().sum() and iter().fold()? In the end they achieve the same goal, but the first code would iterate two times, once for the map and once for the sum.
Here is an example. Which version would be faster in total?
pub fn square(s: u32) -> u64 {
match s {
s # 1...64 => 2u64.pow(s - 1),
_ => panic!("Square must be between 1 and 64")
}
}
pub fn total() -> u64 {
// A fold
(0..64).fold(0u64, |r, s| r + square(s + 1))
// or a map
(1..64).map(square).sum()
}
What would be good tools to look at the assembly or benchmark this?
For them to generate the same code, they'd first have to do the same thing. Your two examples do not:
fn total_fold() -> u64 {
(0..64).fold(0u64, |r, s| r + square(s + 1))
}
fn total_map() -> u64 {
(1..64).map(square).sum()
}
fn main() {
println!("{}", total_fold());
println!("{}", total_map());
}
18446744073709551615
9223372036854775807
Let's assume you meant
fn total_fold() -> u64 {
(1..64).fold(0u64, |r, s| r + square(s + 1))
}
fn total_map() -> u64 {
(1..64).map(|i| square(i + 1)).sum()
}
There are a few avenues to check:
The generated LLVM IR
The generated assembly
Benchmark
The easiest source for the IR and assembly is one of the playgrounds (official or alternate). These both have buttons to view the assembly or IR. You can also pass --emit=llvm-ir or --emit=asm to the compiler to generate these files.
Make sure to generate assembly or IR in release mode. The attribute #[inline(never)] is often useful to keep functions separate to find them easier in the output.
Benchmarking is documented in The Rust Programming Language, so there's no need to repeat all that valuable information.
Before Rust 1.14, these do not produce the exact same assembly. I'd wait for benchmarking / profiling data to see if there's any meaningful impact on performance before I worried.
As of Rust 1.14, they do produce the same assembly! This is one reason I love Rust. You can write clear and idiomatic code and smart people come along and make it equally as fast.
but the first code would iterate two times, once for the map and once for the sum.
This is incorrect, and I'd love to know what source told you this so we can go correct it at that point and prevent future misunderstandings. An iterator operates on a pull basis; one element is processed at a time. The core method is next, which yields a single value, running just enough computation to produce that value.
First, let's fix those example to actually return the same result:
pub fn total_fold_iter() -> u64 {
(1..65).fold(0u64, |r, s| r + square(s))
}
pub fn total_map_iter() -> u64 {
(1..65).map(square).sum()
}
Now, let's develop them, starting with fold. A fold is just a loop and an accumulator, it is roughly equivalent to:
pub fn total_fold_explicit() -> u64 {
let mut total = 0;
for i in 1..65 {
total = total + square(i);
}
total
}
Then, let's go with map and sum, and unwrap the sum first, which is roughly equivalent to:
pub fn total_map_partial_iter() -> u64 {
let mut total = 0;
for i in (1..65).map(square) {
total += i;
}
total
}
It's just a simple accumulator! And now, let's unwrap the map layer (which only applies a function), obtaining something that is roughly equivalent to:
pub fn total_map_explicit() -> u64 {
let mut total = 0;
for i in 1..65 {
let s = square(i);
total += s;
}
total
}
As you can see, the both of them are extremely similar: they have apply the same operations in the same order and have the same overall complexity.
Which is faster? I have no idea. And a micro-benchmark may only tell half the truth anyway: just because something is faster in a micro-benchmark does not mean it is faster in the midst of other code.
What I can say, however, is that they both have equivalent complexity and therefore should behave similarly, ie within a factor of each other.
And that I would personally go for map + sum, because it expresses the intent more clearly whereas fold is the "kitchen-sink" of Iterator methods and therefore far less informative.

Do we need to manually create a destructor for a linked list?

I'm reading Learning Rust With Entirely Too Many Linked Lists and I'm confused about why the linked list (stack) needs a destructor.
I think when the list value is out of the scope, the list itself, and all nodes would be clean up. Is it just for demonstration?
I benchmarked the version with and without a manual destructor and I found the "without destructor" one has better performance:
for _ in 1..30000000 {
let mut list = List::new();
list.push(1);
assert_eq!(list.pop(), Some(1));
}
With manual destructor:
real 0m11.216s
user 0m11.192s
sys 0m 0.020s
Without manual destructor:
real 0m9.071s
user 0m9.044s
sys 0m0.004s
You are correct. The list would clean itself up. As the author stated:
All that's handled for us automatically... with one hitch.
They then explain why the automatic handling is bad:
The automatic destruction process calls drop for the head of the list, which in turn calls drop for the first element. And so on and so on.
This is a function calling a function calling a function (with infinite possible repetitions) which will blow up your stack sooner or later.
This test causes such a stack overflow:
#[test]
fn build_really_big_stack() {
let mut stack = List::new();
for i in 0..1_000_000 {
stack.push(i);
}
}
If you build with --release for both versions, it shows that they perform nearly equally:
#[bench]
fn bench_auto_destructor(b: &mut Bencher) {
b.iter(|| {
let mut list = List::new();
for i in 0..1000 {
list.push(i);
}
assert_eq!(list.pop(), Some(999));
});
}
#[bench]
fn bench_man_destructor(b: &mut Bencher) {
b.iter(|| {
let mut list = ManualDestructorList::new();
for i in 0..1000 {
list.push(i);
}
assert_eq!(list.pop(), Some(999));
});
}
test bench_auto_destructor ... bench: 81,296 ns/iter (+/- 302)
test bench_man_destructor ... bench: 85,756 ns/iter (+/- 164)
With only one element, like in your benchmarks:
test bench_auto_destructor ... bench: 69 ns/iter (+/- 1)
test bench_man_destructor ... bench: 67 ns/iter (+/- 2)
Read the article to the end, its explanation is better than mine.
The reason the author is making you implement your own drop for Link is that calling the destructor on the linked list is not tail recursive, and so if a very large List (i.e. a List whose node count is greater than the number of stack frames allowed by the Rust compiler) goes out of scope and is thus deallocated, then you'll get a stack overflow error when all those drop functions are called recursively. Go read the link I gave above to understand what tail recursion is, but replace the recsum() function with the Link's drop function, and you'll understand why the author made you write your own destructor.
Imagine a List with 1_000_000 Node's. When that List gets deallocated your stack will look like
(Stack Frame #1) List.drop(); // all work is done in this frame, so no need to save stack frame
(Stack Frame #2) self.ptr.drop(); self.ptr.deallocate(); // uh oh! stack frame must be preserved throughout all recursive calls because deallocate must be called after self.ptr.drop() returns
(Stack Frame #3) self.ptr.drop(); self.ptr.deallocate();
...
(Stack Frame #1_000_000) self.ptr.drop(); self.ptr.deallocate(); // we'll never reach here, because a stack overflow error would have been occurred long ago

Chaining checked arithmetic operations in Rust

When doing integer arithmetic with checks for overflows, calculations often need to compose several arithmetic operations. A straightforward way of chaining checked arithmetic in Rust uses checked_* methods and Option chaining:
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
elem_size.checked_mul(length)
.and_then(|acc| acc.checked_add(offset))
}
However, this tells the compiler to generate a branch per each elementary operation. I have encountered a more unrolled approach using overflowing_* methods:
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
let (acc, oflo1) = elem_size.overflowing_mul(length);
let (acc, oflo2) = acc.overflowing_add(offset);
if oflo1 | oflo2 {
None
} else {
Some(acc)
}
}
Continuing computation regardless of overflows and aggregating the overflow flags with bitwise OR ensures that at most one branching is performed in the entire evaluation (provided that the implementations of overflowing_* generate branchless code). This optimization-friendly approach is more cumbersome and requires some caution in dealing with intermediate values.
Does anyone have experience with how the Rust compiler optimizes either of the patterns above on various CPU architectures, to tell whether the explicit unrolling is worthwhile, especially for more complex expressions?
Does anyone have experience with how the Rust compiler optimizes either of the patterns above on various CPU architectures, to tell whether the explicit unrolling is worthwhile, especially for more complex expressions?
You can use the playground to check how LLVM optimizes things: just click on "LLVM IR" or "ASM" instead of "Run". Stick a #[inline(never)] on the function you wish to check, and pay attention to pass it run-time arguments, to avoid constant folding. As in here:
use std::env;
#[inline(never)]
fn calculate_size(elem_size: usize,
length: usize,
offset: usize)
-> Option<usize> {
let (acc, oflo1) = elem_size.overflowing_mul(length);
let (acc, oflo2) = acc.overflowing_add(offset);
if oflo1 | oflo2 {
None
} else {
Some(acc)
}
}
fn main() {
let vec: Vec<usize> = env::args().map(|s| s.parse().unwrap()).collect();
let result = calculate_size(vec[0], vec[1], vec[2]);
println!("{:?}",result);
}
The answer you'll get, however, is that the overflow intrinsics in Rust and LLVM have been coded for convenience and not performance, unfortunately. This means that while the explicit unrolling optimizes well, counting on LLVM to optimize the checked code is not realistic for now.
Normally this is not an issue; but for a performance hotspot, you may want to unroll manually.
Note: this lack of performance is also the reason that overflow checking is disabled by default in Release mode.

Resources