How can I benchmark code that mutates the setup data? - rust

The current implementation of the built-in benchmarking tool appears to run the code inside the iter call multiple times for each time the setup code outside the iter is run. When the code being benchmarked modifies the setup data, subsequent iterations of the benchmarked code are no longer benchmarking the same thing.
As a concrete example, I am benchmarking how fast it takes to remove values from a Vec:
#![feature(test)]
extern crate test;
use test::Bencher;
#[bench]
fn clearing_a_vector(b: &mut Bencher) {
let mut things = vec![1];
b.iter(|| {
assert!(!things.is_empty());
things.clear();
});
}
This will fail:
test clearing_a_vector ... thread 'main' panicked at 'assertion failed: !things.is_empty()', src/lib.rs:11
Performing a similar benchmark of pushing an element onto the vector shows that the iter closure was executed nearly 980 million times (depending on how fast the closure is). The results could be very misleading if there's a single run that does what I expect and millions more that don't.
Tests were run with Rust nightly 1.19.0 (f89d8d184 2017-05-30)

Check out pew, a recently published crate for benchmarking rust code. It allows you to do one time set up that is cloned into every benchmark, or manually run set up by pausing/resuming the benchmark.
This library is in very early phases, but it might be what you're looking for. Contributions are always welcome.

Related

What is the proper way to benchmark things in Rust [duplicate]

Is it possible to benchmark programs in Rust? If yes, how? For example, how would I get execution time of program in seconds?
It might be worth noting two years later (to help any future Rust programmers who stumble on this page) that there are now tools to benchmark Rust code as a part of one's test suite.
(From the guide link below) Using the #[bench] attribute, one can use the standard Rust tooling to benchmark methods in their code.
extern crate test;
use test::Bencher;
#[bench]
fn bench_xor_1000_ints(b: &mut Bencher) {
b.iter(|| {
// Use `test::black_box` to prevent compiler optimizations from disregarding
// Unused values
test::black_box(range(0u, 1000).fold(0, |old, new| old ^ new));
});
}
For the command cargo bench this outputs something like:
running 1 test
test bench_xor_1000_ints ... bench: 375 ns/iter (+/- 148)
test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured
Links:
The Rust Book (section on benchmark tests)
"The Nightly Book" (section on the test crate)
test::Bencher docs
For measuring time without adding third-party dependencies, you can use std::time::Instant:
fn main() {
use std::time::Instant;
let now = Instant::now();
// Code block to measure.
{
my_function_to_measure();
}
let elapsed = now.elapsed();
println!("Elapsed: {:.2?}", elapsed);
}
There are several ways to benchmark your Rust program. For most real benchmarks, you should use a proper benchmarking framework as they help with a couple of things that are easy to screw up (including statistical analysis). Please also read the "Why writing benchmarks is hard" section at the very bottom!
Quick and easy: Instant and Duration from the standard library
To quickly check how long a piece of code runs, you can use the types in std::time. The module is fairly minimal, but it is fine for simple time measurements. You should use Instant instead of SystemTime as the former is a monotonically increasing clock and the latter is not. Example (Playground):
use std::time::Instant;
let before = Instant::now();
workload();
println!("Elapsed time: {:.2?}", before.elapsed());
The underlying platform-specific implementations of std's Instant are specified in the documentation. In short: currently (and probably forever) you can assume that it uses the best precision that the platform can provide (or something very close to it). From my measurements and experiences, this is typically approximately around 20 ns.
If std::time does not offer enough features for your case, you could take a look at chrono. However, for measuring durations, it's unlikely you need that external crate.
Using a benchmarking framework
Using frameworks is often a good idea, because they try to prevent you from making common mistakes.
Rust's built-in benchmarking framework (nightly only)
Rust has a convenient built-in benchmarking feature, which is unfortunately still unstable as of 2019-07. You have to add the #[bench] attribute to your function and make it accept one &mut test::Bencher argument:
#![feature(test)]
extern crate test;
use test::Bencher;
#[bench]
fn bench_workload(b: &mut Bencher) {
b.iter(|| workload());
}
Executing cargo bench will print:
running 1 test
test bench_workload ... bench: 78,534 ns/iter (+/- 3,606)
test result: ok. 0 passed; 0 failed; 0 ignored; 1 measured; 0 filtered out
Criterion
The crate criterion is a framework that runs on stable, but it is a bit more complicated than the built-in solution. It does more sophisticated statistical analysis, offers a richer API, produces more information and can even automatically generate plots.
See the "Quickstart" section for more information on how to use Criterion.
Why writing benchmarks is hard
There are many pitfalls when writing benchmarks. A single mistake can make your benchmark results meaningless. Here is a list of important but commonly forgotten points:
Compile with optimizations: rustc -O3 or cargo build --release. When you are executing your benchmarks with cargo bench, Cargo will automatically enable optimizations. This step is important as there are often large performance difference between optimized and unoptimized Rust code.
Repeat the workload: only running your workload once is almost always useless. There are many things that can influence your timing: overall system load, the operating system doing stuff, CPU throttling, file system caches, and so on. So repeat your workload as often as possible. For example, Criterion runs every benchmarks for at least 5 seconds (even if the workload only takes a few nanoseconds). All measured times can then be analyzed, with mean and standard deviation being the standard tools.
Make sure your benchmark isn't completely removed: benchmarks are very artificial by nature. Usually, the result of your workload is not inspected as you only want to measure the duration. However, this means that a good optimizer could remove your whole benchmark because it does not have side-effects (well, apart from the passage of time). So to trick the optimizer, you have to somehow use your result value so that your workload cannot be removed. An easy way is to print the result. A better solution is something like black_box. This function basically hides a value from LLVM in that LLVM cannot know what will happen with the value. Nothing happens, but LLVM doesn't know. That is the point.
Good benchmarking frameworks use a block box in several situations. For example, the closure given to the iter method (for both, the built-in and Criterion Bencher) can return a value. That value is automatically passed into a black_box.
Beware of constant values: similarly to the point above, if you specify constant values in a benchmark, the optimizer might generate code specifically for that value. In extreme cases, your whole workload could be constant-folded into a single constant, meaning that your benchmark is useless. Pass all constant values through black_box to avoid LLVM optimizing too aggressively.
Beware of measurement overhead: measuring a duration takes time itself. That is usually only tens of nanoseconds, but can influence your measured times. So for all workloads that are faster than a few tens of nanoseconds, you should not measure each execution time individually. You could execute your workload 100 times and measure how long all 100 executions took. Dividing that by 100 gives you the average single time. The benchmarking frameworks mentioned above also use this trick. Criterion also has a few methods for measuring very short workloads that have side effects (like mutating something).
Many other things: sadly, I cannot list all difficulties here. If you want to write serious benchmarks, please read more online resources.
If you simply want to time a piece of code, you can use the time crate. time meanwhile deprecated, though. A follow-up crate is chrono.
Add time = "*" to your Cargo.toml.
Add
extern crate time;
use time::PreciseTime;
before your main function and
let start = PreciseTime::now();
// whatever you want to do
let end = PreciseTime::now();
println!("{} seconds for whatever you did.", start.to(end));
Complete example
Cargo.toml
[package]
name = "hello_world" # the name of the package
version = "0.0.1" # the current version, obeying semver
authors = [ "you#example.com" ]
[[bin]]
name = "rust"
path = "rust.rs"
[dependencies]
rand = "*" # Or a specific version
time = "*"
rust.rs
extern crate rand;
extern crate time;
use rand::Rng;
use time::PreciseTime;
fn main() {
// Creates an array of 10000000 random integers in the range 0 - 1000000000
//let mut array: [i32; 10000000] = [0; 10000000];
let n = 10000000;
let mut array = Vec::new();
// Fill the array
let mut rng = rand::thread_rng();
for _ in 0..n {
//array[i] = rng.gen::<i32>();
array.push(rng.gen::<i32>());
}
// Sort
let start = PreciseTime::now();
array.sort();
let end = PreciseTime::now();
println!("{} seconds for sorting {} integers.", start.to(end), n);
}
This answer is outdated! The time crate does not offer any advantages over std::time in regards to benchmarking. Please see the answers below for up to date information.
You might try timing individual components within the program using the time crate.
A quick way to find out the execution time of a program, regardless of implementation language, is to run time prog on the command line. For example:
~$ time sleep 4
real 0m4.002s
user 0m0.000s
sys 0m0.000s
The most interesting measurement is usually user, which measures the actual amount of work done by the program, regardless of what's going on in the system (sleep is a pretty boring program to benchmark). real measures the actual time that elapsed, and sys measures the amount of work done by the OS on behalf of the program.
Currently, there is no interface to any of the following Linux functions:
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts)
getrusage
times (manpage: man 2 times)
The available ways to measure the CPU time and hotspots of a Rust program on Linux are:
/usr/bin/time program
perf stat program
perf record --freq 100000 program; perf report
valgrind --tool=callgrind program; kcachegrind callgrind.out.*
The output of perf report and valgrind depends on the availability of debugging information in the program. It may not work.
I created a small crate for this (measure_time), which logs or prints the time until end of scope.
#[macro_use]
extern crate measure_time;
fn main() {
print_time!("measure function");
do_stuff();
}
The other solution of measuring execution time is creating a custom type, for example, a struct and implement Drop trait for it.
For example:
struct Elapsed(&'static str, std::time::SystemTime);
impl Drop for Elapsed {
fn drop(&mut self) {
println!(
"operation {} finished for {} ms",
self.0,
self.1.elapsed().unwrap_or_default().as_millis()
);
}
}
impl Elapsed {
pub fn start(op: &'static str) -> Elapsed {
let now = std::time::SystemTime::now();
Elapsed(op, now)
}
}
And using it in some function:
fn some_heavy_work() {
let _exec_time = Elapsed::start("some_heavy_work_fn");
// Here's some code.
}
When the function ends, the drop method for _exec_time will be called and the message will be printed.

Does moving data to Rc/Arc always copy it from the stack to the heap?

Take a look at the following simple example:
use std::rc::Rc;
struct MyStruct {
a: i8,
}
fn main() {
let mut my_struct = MyStruct { a: 0 };
my_struct.a = 5;
let my_struct_rc = Rc::new(my_struct);
println!("my_struct_rc.a = {}", my_struct_rc.a);
}
The official documentation of Rc says:
The type Rc<T> provides shared ownership of a value of type T,
allocated in the heap.
Theoretically it is clear. But, firstly my_struct is not immediately wrapped into Rc, and secondly MyStruct is a very simple type. I can see 2 scenarios here.
When my_struct is moved into the Rc the memory content is literally copied from the stack to the heap.
The compiler is able to resolve that my_struct will be moved into the Rc, so it puts it on the heap from the beginning.
If number 1 is true, then there might be a hidden performance bottleneck as when reading the code one does not explicitly see memory being copied (I am assuming MyStruct being much more complex).
If number 2 is true, I wonder whether the compiler is always able to resolve such things. The provided example is very simple, but I can imagine that my_struct is much more complex and is mutated several times by different functions before being moved to the Rc.
Tl;dr It could be either scenario, but for the most part, you should just write code in the most obvious way and let the compiler worry about it.
According to the semantics of the abstract machine, that is, the theoretical model of computation that defines Rust's behavior, there is always a copy. In fact, there are at least two: my_struct is first created in the stack frame of main, but then has to be moved into the stack frame of Rc::new. Then Rc::new has to create an allocation and move my_struct a second time, from its own stack frame into the newly allocated memory*. Each of these moves is conceptually a copy.
However, this analysis isn't particularly useful for predicting the performance of code in practice, for three reasons:
Copies are actually pretty darn cheap. Moving my_struct from one place to another may actually be much cheaper, in the long run, than referencing it with a pointer. Copying a chunk of bytes is easy to optimize on modern processors; following a pointer to some arbitrary location is not. (Bear in mind also that the complexity of the structure is irrelevant because all moves are bytewise copies; for instance, moving any Vec is just copying three usizes regardless of the contents.)
If you haven't measured the performance and shown that excessive copying is a problem, you must not assume that it is without evidence: you may accidentally pessimize instead of optimizing your code. Measure first.
The semantics of the abstract machine is not the semantics of your real machine. The whole point of an optimizing compiler is to figure out the best way to transform one to the other. Under reasonable assumptions, it's very unlikely that the code here would result in 2 copies with optimizations turned on. But how the compiler eliminates one or both copies may be dependent on the rest of the code: not just on the snippet that contains them but on how the data is initialized and so forth. Real machine performance is complicated and generally requires analysis of more than just a few lines at a time. Again, this is the whole point of an optimizing compiler: it can do a much more comprehensive analysis, much faster than you or I can.
Even if the compiler leaves a copy "on the table", you shouldn't assume without evidence that removing the copy would make things better simply because it is a copy. Measure first.
It probably doesn't matter anyway, in this case. Requesting a new allocation from the heap is likely† more expensive than copying a bunch of bytes from one place to another, so fiddling around with 1 fast copy vs. no copies while ignoring a (plausible) big bottleneck is probably a waste of time. Don't try to optimize things before you've profiled your application or library to see where the most performance is being lost. Measure first.
See also
Questions about overflowing the stack by accidentally putting large data on it (to which the solution is usually to use Vec instead of an array):
How to allocate arrays on the heap in Rust 1.0?
Thread '<main>' has overflowed its stack when allocating a large array using Box
* Rc, although part of the standard library, is written in plain Rust code, which is how I analyze it here. Rc could theoretically be subject to guaranteed optimizations that aren't available to ordinary code, but that doesn't happen to be relevant to this case.
† Depending at least on the allocator and on whether new memory must be acquired from the OS or if a recently freed allocation can be re-used.
You can just test what happens:
Try to use my_struct after creating an Rc out of it. The value has been moved, so you can't use it.
use std::rc::Rc;
struct MyStruct {
a: i8,
}
fn main() {
let mut my_struct = MyStruct { a: 0 };
my_struct.a = 5;
let my_struct_rc = Rc::new(my_struct);
println!("my_struct_rc.a = {}", my_struct_rc.a);
// Add this line. Compilation error "borrow of moved value"
println!("my_struct.a = {}", my_struct.a);
}
Make your struct implement the Copy trait, and it will be automatically copied into the Rc::new function. Now the code above works, because the my_struct variable is not moved anywhere, just copied.
#[derive(Clone, Copy)]
struct MyStruct {
a: i8,
}
The compiler is able to resolve that my_struct will be moved into the Rc, so it puts it on the heap from the beginning.
Take a look at Rc::new source code (removed the comment which is irrelevant).
struct RcBox<T: ?Sized> {
strong: Cell<usize>,
weak: Cell<usize>,
value: T,
}
// ...
pub fn new(value: T) -> Rc<T> {
Self::from_inner(Box::into_raw_non_null(box RcBox {
strong: Cell::new(1),
weak: Cell::new(1),
value,
}))
}
It takes the value you pass to it, and creates a Box, so it's always put on the heap. This is plain Rust and I don't think it performs too many sophisticated optimizations, but that may change.
Note that "move" in Rust may also copy data implicitly, and this may depend on the current compiler's behavior. In that case, if you are concerned about performance you can try to make the struct as small as possible, and store some information on the heap. For example when a Vec<T> is moved, as far as I know it only copies the capacity, length and pointer to the heap, but the actual array which is on the heap is not copied element by element, so only a few bytes are copied when moving a vector (assuming the data is copied, because that's also subject to compiler optimizations in case copying is not actually needed).

Is there a consistent compilation context inside a proc_macro_attribute function?

Take the example below:
static COUNT: AtomicUsize = AtomicUsize::new(0);
#[proc_macro_attribute]
pub fn count_usages(_attr: TokenStream, item: TokenStream) -> TokenStream {
let c = COUNT.fetch_add(1, Ordering::AcqRel);
println!("Do stuff with c: {}", c);
item
}
Although the order attributes are processed may differ, will the final count be the same each time for cases such as:
Incremental building
Registry crates and local crates sharing the same proc_macro library and version
Compiler internal parallelism
A practical use case (mine in particular) is generating a compile-time pseudo-static variable memory layout that will be reused in multiple memory managers within a statically linked executable.
While some things might work with the current implementation, procedural macros should not use global variables and expect them to be preserved between calls.
There is currently no official mechanism for storing state between invocations of a procedural macro.
The issue “Crate local state for procedural macros?” mentions some of these points:
Proc macros may not be run on every compilation, for instance if incremental compilation is on and they are in a module that is clean
There is no guarantee of ordering -- if do_it! needs data from all config! invocations, that's a problem.
Properly supporting this feature means adding a new API

Will Rayon avoid spawning threads for a small amount of work?

I was thinking about using Rayon's parallel iterator feature, but I'm concerned about performance for iterating over small collections.
Parallelism overhead sometimes can cause a slowdown on small collections. Iterating over 2 elements is slower if I do the necessary preparations for multi-threading than if I used a single-threaded version. If I have 40 million elements, parallelism will give me a linear performance improvement.
I read about ParallelIterator::weight (0.6.0), but I don't understand if I should optimize such corner cases for small collections or if Rayon is smart and handles everything under the under the hood.
if collection_is_small() {
// Run single threaded version...
} else {
// Use parallel iterator.
}
The ParallelIterator::weight of the processed element is 1. See relevant documentation for good definition, but processing of a single element is cheap.
Google sent me to an old documentation page. Weight was deprecated and removed since version 0.8.0.
Weight API was deprecated in favor of split length control.
By default Rayon will split at every item, effectively making all computation parallel, this behavior can be configured via
with_min_len.
Sets the minimum length of iterators desired to process in each thread. Rayon will not split any smaller than this length, but of course an iterator could already be smaller to begin with.
Producers like zip and interleave will use greater of the two minimums. Chained iterators and iterators inside flat_map may each use their own minimum length.
extern crate rayon; // 1.0.3
use rayon::prelude::*;
use std::thread;
fn main() {
println!("Main thread: {:?}", thread::current().id());
let ids: Vec<_> = (0..4)
.into_par_iter()
.with_min_len(4)
.map(|_| thread::current().id())
.collect();
println!("Iterations: {:?}", ids);
}
Output:
Main thread: ThreadId(0)
Iterations: [ThreadId(0), ThreadId(0), ThreadId(0), ThreadId(0)]
Playground (thanks to #shepmaster for code)
You can empirically see that such a behavior is not guaranteed:
use rayon::prelude::*; // 1.0.3
use std::thread;
fn main() {
let ids: Vec<_> = (0..2)
.into_par_iter()
.map(|_| thread::current().id())
.collect();
println!("{:?}", ids);
}
Various runs of the program show:
[ThreadId(1), ThreadId(2)]
[ThreadId(1), ThreadId(1)]
[ThreadId(2), ThreadId(1)]
[ThreadId(2), ThreadId(2)]
That being said, you should perform your own benchmarking. By default, Rayon creates a global threadpool and uses work stealing to balance the work between the threads. The threadpool is a one-time setup cost per process and work-stealing helps ensure that work only crosses thread boundaries when needed. This is why there are outputs above where both use the same thread.

How do I get detailed flamegraphs with the flame crate for code written using Rayon?

I'm trying to get some performance metrics using the flame crate with code I've written using Rayon:
extern crate flame;
flame::start("TAG-A");
//Assume vec is a Vec<i32>
vec.par_iter_mut().filter(|a| a == 1).for_each(|b| func(b));
//func(b) operates on each i32 and sends some results to a channel
flame::end("TAG-A");
//More code but unrelated
flame::dump_stdout();
This works fine, but only gives information for the entire parallel iterator. I would like to get some more fine grained details on the function func.
I've tried adding a start/end within the function, but the runtime information is only available when I call flame::commit_thread() and then it seems to only print this to stdout. Ideally I'd like to print out the time spent without a given tag when I call dump at the end of my code.
Is there a way to dump tags from all threads? The documentation for flame isn't great.

Resources