Why is running cargo bench faster than running release build? - rust

I want to benchmark my Rust programs, and was comparing some alternatives to do that. I noted, however, that when running a benchmark with cargo bench and the bencher crate, the code runs consistently faster than when running a production build (cargo build --release) with the same code. For example:
Main code:
use dot_product;
const N: usize = 1000000;
use std::time;
fn main() {
let start = time::Instant::now();
dot_product::rayon_parallel([1; N].to_vec(), [2; N].to_vec());
println!("Time: {:?}", start.elapsed());
}
Average time: ~20ms
Benchmark code:
#[macro_use]
extern crate bencher;
use dot_product;
use bencher::Bencher;
const N: usize = 1000000;
fn parallel(bench: &mut Bencher) {
bench.iter(|| dot_product::rayon_parallel([1; N].to_vec(), [2; N].to_vec()))
}
benchmark_group!(benches, sequential, parallel);
benchmark_main!(benches);
Time: 5,006,199 ns/iter (+/- 1,320,975)
I tried the same with some other programs and cargo bench gives consistently faster results. Why could this happen?

As the comments suggested, you should use criterion::black_box() on all (final) results in the benchmarking code. This function does nothing - and simply gives back its only parameter - but is opaque to the optimizer, so the compiler has to assume the function does something with the input.
When not using black_box(), the benchmarking code doesn't actually do anything, as the compiler is able to figure out that the results of your code are unused and no side-effects can be observed. So it removes all your code during dead-code elimination and what you end up benchmarking is the benchmarking-suite itself.

Related

Prefetching and yielding to "hide" cache misses in rust

When C++20 got stackless coroutines some papers successfully used them to "hide" cache misses via prefetching and switching to another coroutine. As far as I can tell rust's async is also like stackless coroutines in that it is a "zero cost abstraction". Is there work similar to the ones I mentioned for implementing such techniques in rust? If not is there is anything fundamentally preventing one from doing something like that with async/await?
Edit: I wanted to give a very high level and simplified summary of what I understand that the papers propose:
we want to run a bunch of independent processes that look like the following
P1 = load1;proc1; load1';proc1'; load1'';proc1''; ...
P2 = load2;proc2; load2';proc2'; load2'';proc2''; ...
...
PN = loadN;procN; loadN';procN'; loadN'';procN''; ...
where all the loadI terms are likely to cause a cache miss. The authors leverage coroutines to (dynamically) interleave the processes so that the code executed looks like the following:
P =
prefetch1;prefetch2;...;prefetchN;
load1;proc1;prefetch1'; # yield to the scheduler
load2;proc2;prefetch2'; # yield to the scheduler
...
loadN;procN;prefetchN'; # yield to the scheduler
load1';proc1';prefetch1''; # yield to the scheduler
load2';proc2';prefetch2''; # yield to the scheduler
...
loadN';procN';prefetchN''; # yield to the scheduler
...
The full code for this post can be found here.
I didn't look super hard, but as far as I'm aware there is no existing research on the topic, so I decided to do a little bit of it myself. Unlike the mentioned papers, I took a simple approach to test if it was possible by creating a couple of simple linked lists and summing them:
pub enum List<T> {
Cons(T, Box<List<T>>),
Nil,
}
impl<T> List<T> {
pub fn new(iter: impl IntoIterator<Item = T>) -> Self {
let mut tail = List::Nil;
for item in iter {
tail = List::Cons(item, Box::new(tail));
}
tail
}
}
const END: i32 = 1024 * 1024 * 1024 / 16;
fn gen_lists() -> (List<i32>, List<i32>) {
let range = 1..=END;
(List::new(range.clone()), List::new(range))
}
Ok, a couple of big simple linked lists. I ran nine different algorithms in two different benchmarks to see how the prefetching affected things. The benchmarks are summing the lists in an owned fashion, where the list is destroyed during iteration, causing deallocation to be the bulk of the measured time, and summing them in a borrowed fashion, where the deallocation time is not measured. The various algorithms tested are really only three different algorithms implemented with three different techniques, Iterators, Generators, and async Streams.
The three algorithms are zip, which is iterating over both lists in lockstep, chain, which is iterating over the lists one after the other, and zip prefetch, where the prefetch and switch method is used while zipping the two lists together. The basic iterator looks like this:
pub struct ListIter<T>(List<T>);
impl<T> Iterator for ListIter<T> {
type Item = T;
fn next(&mut self) -> Option<T> {
let mut temp = List::Nil;
std::mem::swap(&mut temp, &mut self.0);
match temp {
List::Cons(t, next) => {
self.0 = *next;
Some(t)
}
List::Nil => None,
}
}
}
and the version with prefetching looks like this:
pub struct ListIterPrefetch<T>(List<T>);
impl<T> Iterator for ListIterPrefetch<T> {
type Item = T;
fn next(&mut self) -> Option<T> {
let mut temp = List::Nil;
std::mem::swap(&mut temp, &mut self.0);
match temp {
List::Cons(t, next) => {
self.0 = *next;
if let List::Cons(_, next) = &self.0 {
unsafe { prefetch_read_data::<List<T>>(&**next, 3) }
}
Some(t)
}
List::Nil => None,
}
}
}
There are also different implementations for Generators and Streams, as well as a version that operates over references but they all look pretty much the same, so I am omitting them for brevity. The test harness is pretty simple- it just takes in a name-function pair and times it:
type BenchFn<T> = fn(List<T>, List<T>) -> T;
fn bench(funcs: &[(&str, BenchFn<i32>)]) {
for (s, f) in funcs {
let (l, r) = gen_lists();
let now = Instant::now();
println!("bench: {s} result: {} time: {:?}", f(l, r), now.elapsed());
}
}
For example, usage with the basic iterator tests:
bench(&[
("iter::zip", |l, r| {
l.into_iter().zip(r).fold(0, |a, (l, r)| a + l + r)
}),
("iter::zip prefetch", |l, r| {
l.into_iter_prefetch()
.zip(r.into_iter_prefetch())
.fold(0, |a, (l, r)| a + l + r)
}),
("iter::chain", |l, r| l.into_iter().chain(r).sum()),
]);
The results of the test on my computer, which is an Intel(R) Core(TM) i5-8365U CPU with 24 Gb of RAM:
Bench owned
bench: iter::zip result: 67108864 time: 11.1873901s
bench: iter::zip prefetch result: 67108864 time: 19.3889487s
bench: iter::chain result: 67108864 time: 8.4363853s
bench: generator zip result: 67108864 time: 16.7242197s
bench: generator chain result: 67108864 time: 8.9897788s
bench: generator prefetch result: 67108864 time: 11.7599589s
bench: stream::zip result: 67108864 time: 14.339864s
bench: stream::chain result: 67108864 time: 7.7592133s
bench: stream::zip prefetch result: 67108864 time: 11.1455706s
Bench ref
bench: iter::zip result: 67108864 time: 1.1343996s
bench: iter::zip prefetch result: 67108864 time: 864.4865ms
bench: iter::chain result: 67108864 time: 1.4036277s
bench: generator zip result: 67108864 time: 1.1360857s
bench: generator chain result: 67108864 time: 1.740029s
bench: generator prefetch result: 67108864 time: 904.1086ms
bench: stream::zip result: 67108864 time: 1.0902568s
bench: stream::chain result: 67108864 time: 1.5683112s
bench: stream::zip prefetch result: 67108864 time: 1.2031745s
The result is the sum calculated, and time was the recorded time. When looking at the destructive summation benchmarks, a few things stand out:
The chain algorithm works the best. My guess is this is because this method improves cache locality for the allocator, which is where the vast majority of the time is spent.
Prefetching greatly improves the time with the Stream and generator versions, bring their times on par with the standard iterator.
Prefetching completely ruins the iterator strategy. This is why you always benchmark when doing these things. I also would not be surprised if this is a failure by the compiler to optimize properly, rather than the prefetching directly hurting performance.
When looking at the borrowing summation a few things stand out:
Without measuring deallocation times, the recorded times are much shorter. This is how I know deallocation is most of the above benchmark.
The chain method loses- apparently running in lockstep is the way to go.
Prefetching is the way to go for iterators and generators. In contrast to the previous benchmark, prefetching causes the iterator to be the fastest strategy, rather than the slowest.
Prefetching causes a slow down when using streams, although the streams just performed poorly overall.
This testing wasn't the most scientific, for a variety reasons, and isn't for a particularly realistic workload, but given the results I can confidantly say that a prefetch and switch strategy can definitely result in solid performance improvements if done right. I also omitted a fair bit of the testing code for brevity, and the full code can be found here.

Rust: initialize a static variable/reference in a lib?

I'm new to Rust. I'm trying to create a static variable DATA of Vec<u8> in a library so that it is initialized after the compilation of the lib. I then include the lib in the main code hoping to use DATA directly without calling init_data() again. Here's what I've tried:
my_lib.rs:
use lazy_static::lazy_static;
pub fn init_data() -> Vec<u8> {
// some expensive calculations
}
lazy_static! {
pub static ref DATA: Vec<u8> = init_data(); // supposed to call init_data() only once during compilation
}
main.rs:
use my_lib::DATA;
call1(&DATA); // use DATA here without calling init_data()
call2(&DATA);
But it turned out that init_data() is still called in the main.rs. What's wrong with this code?
Update: as Ivan C pointed out, lazy_static is not run at compile time. So, what's the right choice for 'pre-loading' the data?
There are two problems here: the choice of type, and performing the allocation.
It is not possible to construct a Vec, a Box, or any other type that requires heap allocation at compile time, because the heap allocator and the heap do not yet exist at that point. Instead, you must use a reference type, which can point to data allocated in the binary rather than in the run-time heap, or an array without any reference (if the data is not too large).
Next, we need a way to perform the computation. Theoretically, the cleanest option is constant evaluation — straightforwardly executing parts of your code at compile time.
static DATA: &'static [u8] = {
// code goes here
};
However, in current stable Rust versions (1.58.1 as I'm writing this), constant evaluation is very limited, because you cannot do anything that looks like dropping a value, or use any function belonging to a trait. It can still do some things, mostly integer arithmetic or constructing other "almost literal" data; for example:
const N: usize = 10;
static FIRST_N_FIBONACCI: &'static [u32; N] = &{
let mut array = [0; N];
array[1] = 1;
let mut i = 2;
while i < array.len() {
array[i] = array[i - 1] + array[i - 2];
i += 1;
}
array
};
fn main() {
dbg!(FIRST_N_FIBONACCI);
}
If your computation cannot be expressed using const evaluation, then you will need to perform it another way:
Procedural macros are effectively compiler plugins, and they can perform arbitrary computation, but their output is generated Rust syntax. So, a procedural macro could produce an array literal with the precomputed data.
The main limitation of procedural macros is that they must be defined in dedicated crates (so if your project is one library crate, it would now be two instead).
Build scripts are ordinary Rust code which can compile or generate files used by the main compilation. They don't interact with the compiler, but are run by Cargo before compilation starts.
(Unlike const evaluation, both build scripts and proc macros can't use any of the types or constants defined within the crate being built itself; they can read the source code, but they run too early to use other items in the crate in their own code.)
In your case, because you want to precompute some [u8] data, I think the simplest approach would be to add a build script which writes the data to a file, after which your normal code can embed this data from the file using include_bytes!.

Parsing 40MB file noticeably slower than equivalent Pascal code [duplicate]

This question already has an answer here:
Why is my Rust program slower than the equivalent Java program?
(1 answer)
Closed 2 years ago.
use std::fs::File;
use std::io::Read;
fn main() {
let mut f = File::open("binary_file_path").expect("no file found");
let mut buf = vec![0u8;15000*707*4];
f.read(&mut buf).expect("Something went berserk");
let result: Vec<_> = buf.chunks(2).map(|chunk| i16::from_le_bytes([chunk[0],chunk[1]])).collect();
}
I want to read a binary file. The last line takes around 15s. I'd expect it to only take a fraction of a second. How can I optimise it?
Your code looks like the compiler should be able to optimise it decently. Make sure that you compile it in release mode using cargo build --release. Converting 40MB of data to native endianness should only take a fraction of a second.
You can simplify the code and save some unnecessary copying by using the byeteorder crate. It defines an extension trait for all implementors of Read, which allows you to directly call read_i16_into() on the file object.
use byteorder::{LittleEndian, ReadBytesExt};
use std::fs::File;
let mut f = File::open("binary_file_path").expect("no file found");
let mut result = vec![0i16; 15000 * 707 * 2];
f.read_i16_into::<LittleEndian>(&mut result).unwrap();
cargo build --release improved the performance

Recursive function calculating factorials leads to stack overflow

I tried a recursive factorial algorithm in Rust. I use this version of the compiler:
rustc 1.12.0 (3191fbae9 2016-09-23)
cargo 0.13.0-nightly (109cb7c 2016-08-19)
Code:
extern crate num_bigint;
extern crate num_traits;
use num_bigint::{BigUint, ToBigUint};
use num_traits::One;
fn factorial(num: u64) -> BigUint {
let current: BigUint = num.to_biguint().unwrap();
if num <= 1 {
return One::one();
}
return current * factorial(num - 1);
}
fn main() {
let num: u64 = 100000;
println!("Factorial {}! = {}", num, factorial(num))
}
I got this error:
$ cargo run
thread 'main' has overflowed its stack
fatal runtime error: stack overflow
error: Process didn't exit successfully
How to fix that? And why do I see this error when using Rust?
Rust doesn't have tail call elimination, so your recursion is limited by your stack size. It may be a feature for Rust in the future (you can read more about it at the Rust FAQ), but in the meantime you will have to either not recurse so deep or use loops.
Why?
This is a stack overflow which occurs whenever there is no stack memory left. For example, stack memory is used by
local variables
function arguments
return values
Recursion uses a lot of stack memory, because for every recursive call, the memory for all local variables, function arguments, ... has to be allocated on the stack.
How to fix that?
The obvious solution is to write your algorithm in a non-recursive manner (you should do this when you want to use the algorithm in production!). But you can also just increase the stack size. While the stack size of the main thread can't be modified, you can create a new thread and set a specific stack size:
fn main() {
let num: u64 = 100_000;
// Size of one stack frame for `factorial()` was measured experimentally
thread::Builder::new().stack_size(num as usize * 0xFF).spawn(move || {
println!("Factorial {}! = {}", num, factorial(num));
}).unwrap().join();
}
This code works and, when executed via cargo run --release (with optimization!), outputs the solution after only a couple of seconds calculation.
Measuring stack frame size
In case you want to know how the stack frame size (memory requirement for one call) for factorial() was measured: I printed the address of the function argument num on each factorial() call:
fn factorial(num: u64) -> BigUint {
println!("{:p}", &num);
// ...
}
The difference between two successive call's addresses is (more or less) the stack frame size. On my machine, the difference was slightly less than 0xFF (255), so I just used that as size.
In case you're wondering why the stack frame size isn't smaller: the Rust compiler doesn't really optimize for this metric. Usually it's really not important, so optimizers tend to sacrifice this memory requirement for better execution speed. I took a look at the assembly and in this case many BigUint methods were inlined. This means that the local variables of other functions are using stack space as well!
Just as an alternative.. (I do not recommend)
Matts answer is true to an extent. There is a crate called stacker (here) that can artificially increase the stack size for usage in recursive algorithms. It does this by allocating some heap memory to overflow into.
As a word of warning... this takes a very long time to run ... but, it runs, and it doesn't blow the stack. Compiling with optimizations brings it down but its still pretty slow. You're likely to get better perf from a loop as Matt suggests. I thought I would throw this out there anyway.
extern crate num_bigint;
extern crate num_traits;
extern crate stacker;
use num_bigint::{BigUint, ToBigUint};
use num_traits::One;
fn factorial(num: u64) -> BigUint {
// println!("Called with: {}", num);
let current: BigUint = num.to_biguint().unwrap();
if num <= 1 {
// println!("Returning...");
return One::one();
}
stacker::maybe_grow(1024 * 1024, 1024 * 1024, || {
current * factorial(num - 1)
})
}
fn main() {
let num: u64 = 100000;
println!("Factorial {}! = {}", num, factorial(num));
}
I have commented out the debug printlns.. you can uncomment them if you like.

Why doesn't println! work in Rust unit tests?

I've implemented the following method and unit test:
use std::fs::File;
use std::path::Path;
use std::io::prelude::*;
fn read_file(path: &Path) {
let mut file = File::open(path).unwrap();
let mut contents = String::new();
file.read_to_string(&mut contents).unwrap();
println!("{}", contents);
}
#[test]
fn test_read_file() {
let path = &Path::new("/etc/hosts");
println!("{:?}", path);
read_file(path);
}
I run the unit test this way:
rustc --test app.rs; ./app
I could also run this with
cargo test
I get a message back saying the test passed but the println! is never displayed on screen. Why not?
This happens because Rust test programs hide the stdout of successful tests in order for the test output to be tidy. You can disable this behavior by passing the --nocapture option to the test binary or to cargo test (but, in this case after -- – see below):
#[test]
fn test() {
println!("Hidden output")
}
Invoking tests:
% rustc --test main.rs; ./main
running 1 test
test test ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured
% ./main --nocapture
running 1 test
Hidden output
test test ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured
% cargo test -- --nocapture
running 1 test
Hidden output
test test ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured
If tests fail, however, their stdout will be printed regardless if this option is present or not.
TL;DR
$ cargo test -- --nocapture
With the following code:
#[derive(Copy, Clone, Debug, PartialEq, Eq)]
pub enum PieceShape {
King, Queen, Rook, Bishop, Knight, Pawn
}
fn main() {
println!("Hello, world!");
}
#[test]
fn demo_debug_format() {
let q = PieceShape::Queen;
let p = PieceShape::Pawn;
let k = PieceShape::King;
println!("q={:?} p={:?} k={:?}", q, p, k);
}
Then run the following:
$ cargo test -- --nocapture
And you should see
Running target/debug/chess-5d475d8baa0176e4
running 1 test
q=Queen p=Pawn k=King
test demo_debug_format ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured
As mentioned by L. F., --show-output is the way to go.
$ cargo test -- --show-output
Other display flags are mentioned in the documentation of cargo test in display-options.
To include print outs with println!() and keep colors for the test results, use the color and nocapture flags in cargo test.
$ cargo test -- --color always --nocapture
(cargo version: 0.13.0 nightly)
While testing, standard output is not displayed. Don't use text messages for testing but assert!, assert_eq!, and fail! instead. Rust's unit test system can understand these but not text messages.
The test you have written will pass even if something goes wrong. Let's see why:
read_to_end's signature is
fn read_to_end(&mut self) -> IoResult<Vec<u8>>
It returns an IoResult to indicate success or error. This is just a type def for a Result whose error value is an IoError. It's up to you to decide how an error should be handled. In this case, we want the task to fail, which is done by calling unwrap on the Result.
This will work:
let contents = File::open(&Path::new("message.txt"))
.read_to_end()
.unwrap();
unwrap should not be overused though.
Note that the modern solution (cargo test -- --show-output) doesn't work in doctests defined in a Markdown code-fence in the docstring of your functions. Only println! (etc.) statements done in a concrete #[test] block will be respected.
It's likely that the test output is being captured by the testing framework and not being printed to the standard output. When running tests with cargo test, the output of each test is captured and displayed only if the test fails. If you want to see the output of a test, you can use the --nocapture flag when running the test with cargo test. Like so:
cargo test -- --nocapture
Or you can use the println! macro inside a test function to print output to the standard output. Like so:
#[test]
fn test_read_file() {
let path = &Path::new("/etc/hosts");
println!("{:?}", path);
read_file(path);
println!("The test passed!");
}
Why? I don't know, but there is a small hack eprintln!("will print in {}", "tests")
In case you want to run the test displaying the printed output everytime the file changes:
sudo cargo watch -x "test -- --nocapture"
sudo might be optional depending on your set-up.

Resources