What are the actual runtime performance costs of dynamic dispatch?

What are the actual runtime performance costs of dynamic dispatch? - rust

There's some background on this topic in the Rust book section on static and dynamic dispatch, but the tl;dr is that calling a method on a trait reference and a few other various situation (function pointers, etc) results in dynamic instead of static dispatch.
What is actual runtime cost of this, after optimizations have been applied?
For example, imagine this set of structs & traits:
struct Buffer;
struct TmpBuffer;
struct TmpMutBuffer;
impl BufferType for Buffer { ... }
impl BufferType for BufferTmp { ... }
impl BufferType for BufferTmpMut { ... }
impl Buffer2D for BufferType { ... }
impl Buffer2DExt for Buffer2D { ... }
Notice that the traits here are implemented on traits themselves.
What is the calling cost of dynamic dispatch to invoke a method from Buffer2DExt on a struct reference?
The recent question What are Rust's exact auto-dereferencing rules? regards the dereferencing rules; are these rules applied at compile time, or runtime?

Disclaimer: the question is fairly open-ended, therefore this answer might be incomplete. Treat it with an even bigger grain of salt than you would normally.
Rust uses a simple "virtual table" to implement dynamic dispatch. This strategy is also used in C++ for which you can see a study here. The study is a bit dated though.
The cost of indirection
Virtual dispatch induces indirection, this has a cost for multiple reasons:
indirection is opaque: this inhibits inlining and constant propagation which are key enablers for many compiler optimizations
indirection has a runtime cost: if incorrectly predicted, you are looking at pipeline stalls and expensive memory fetches
Optimizing indirection
Compilers, however, muddies the water by trying their best at optimizing indirection away.
devirtualization: sometimes the compiler can resolve the virtual table look-up at compile time (usually, because it knows the concrete type of the object); if so, it can therefore use a regular function call rather than an indirect one, and optimize away the indirection
probabilistic devirtualization: last year Honza Hubička introduced a new optimization in gcc (read the 5-part series as it is very instructive). The gist of the strategy is to build the inheritance graph to make an educated guess at the potential type(s), and then use a pattern like if v.hasType(A) { v.A::call() } elif v.hasType(B) { v.B::call() } else { v.virtual-call() }; special-casing the most likely types means regular calls in this case, and therefore inlined/constant-propagated/full-goodies calls.
This latter strategy could be fairly interesting in Rust due to the coherence rules and privacy rules, as it should have more cases where the full "inheritance" graph is provably known.
The cost of monomorphization
In Rust, you can use compile-time polymorphism instead of run-time polymorphism; the compiler will emit one version of the function for each unique combination of compile-time parameters it takes. This, itself, has a cost:
compilation time cost: more code to be produced, more code to be optimized
binary size cost: the produced binary will end-up being larger, a typical size/speed trade-off
run-time cost: possibly, the larger code size might lead to cache misses at CPU level
The compiler might be able to merge together specialized functions that end up having the same implementation (because of phantom types, for example), however it is still more than likely than the produced binaries (executables and libraries) will end up larger.
As usual with performance, you have to measure in your case what is more beneficial.

Related

Is it idiomatic to panic in From implementations?

The documentation at https://doc.rust-lang.org/std/convert/trait.From.html states
Note: This trait must not fail. If the conversion can fail, use TryFrom.
Suppose I have a From implementation thus:
impl From<SomeStruct> for http::Uri {
fn from(item: SomeStruct) -> http::Uri {
item.uri.parse::<http::Uri>() // can fail
}
}
Further suppose I am completely certain that item.uri.parse will succeed. Is it idiomatic to panic in this scenario? Say, with:
item.uri.parse::<http::Uri>().unwrap()
In this particular case, it appears there's no way to construct an HTTP URI at compile time: https://docs.rs/http/0.2.5/src/http/uri/mod.rs.html#117. In the real scenario .uri is an associated const, so I can test all used values parse. But it seems to me there could be other scenarios when the author is confident in the infallibility of a piece of code, particularly when that confidence can be encoded in tests, and would therefore prefer the ergonomics of From over TryFrom. The Rust compiler, typically quite strict, doesn't prevent this behaviour, though it seems it perhaps could. This makes me think this is a decision the author has been deliberately allowed to make. So the question is asking: what do people tend to do in this situation?

So in general, traits only enforce that the implementors adhere to the signatures and types as laid out in the trait. At least that's what the compiler enforces.
On top of that, there are certain contracts that traits are expected to adhere to just so that there's no weird surprises by those who work with these traits. These contracts aren't checked by the compiler; that would be quite difficult.
Nothing prevents you from implementing all a trait's methods but in way that's totally unrelated to what the trait is all about, like implementing the Display trait but then in the fmt method not actually bothering to use write! and instead, I don't know, delete the user's home directory.
Now back to your specific case. If your from method will not fail, provably so, then of course you can use .unwrap. The point of the cannot fail contract for the From trait is that those who rely on the From trait want to be able to assume that the conversion will go through every time. If you actually panic in your own implementation of from, it means the conversion sometimes doesn't go through, counter to the ideas and contracts in the From trait.

How to express this concept without dynamic dispatch? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am building a simulation (the coding equivalent of a model train set). It is a simulated economy with various economic agents interacting with each other. The main mode of interaction between economic agents is a transaction. At each "tic", each agent generates a list of zero or more proposed transactions (such as buying food). At each "toc" all the counter-parties process the proposed transactions that have been targeted at them in random order so that no biases are introduced. In these snippets a proposed transaction is represented as a u32.
My goal is to simulate as many of these economic agents as possible so performance is key. I am new to rust (or any kind of low level language for that matter) and my understanding from reading the rust book is if I want maximum performance then use "zero cost abstractions" and avoid dynamic dispatch.
So with that out the way I have come up with the following 3 approaches.
Option 1
trait EconomicAgent {
fn proposed_transactions(&self) -> Vec<u32>;
}
struct Person {
health:f64,
energy:f64,
nutrition:f64,
money:f64,
food:f64
}
impl EconomicAgent for Person {
fn proposed_transactions(&self) -> Vec<u32> {
vec![1, 2, 3]
}
}
struct FoodStore {
money:f64,
food:f64
}
impl EconomicAgent for FoodStore {
fn proposed_transactions(&self) -> Vec<u32> {
vec![4, 5, 6]
}
}
A person and a food store are different types that implement the EconomicAgent trait. I can then iterate over a vector of trait objects to get a list of proposed transactions. Each call is dynamically dispatched, I believe.
Option 2
enum EconomicAgent2 {
Person(Person),
FoodStore(FoodStore)
}
impl EconomicAgent2 {
fn proposed_transactions(&self) -> Vec<u32> {
match self{
EconomicAgent2::Person(person) => person.proposed_transactions(),
EconomicAgent2::FoodStore(food_store) => food_store.proposed_transactions()
}
}
}
Here, an EconomicAgent is not a trait, but rather an enum and, well you can see how it works.
Option 3
const HEALTH_INDEX : u8 = 0;
const ENERGY_INDEX : u8 = 1;
const NUTRITION_INDEX : u8 = 2;
const MONEY_INDEX : u8 = 3;
const FOOD_INDEX : u8 = 4;
enum EconomicAgentTag {
Person,
FoodStore
}
struct EconomicAgent3 {
tag: EconomicAgentTag,
resources:[f64; 5],
proposed_transactions: Box<fn(&EconomicAgent3) -> Vec<u32>>
}
fn new_person() -> EconomicAgent3 {
EconomicAgent3 {
tag: EconomicAgentTag::Person,
resources: [0.0,0.0,0.0,0.0,0.0],
proposed_transactions: Box::new(|_| vec![1, 2, 3])
}
}
fn new_food_Store() -> EconomicAgent3 {
EconomicAgent3 {
tag: EconomicAgentTag::FoodStore,
resources: [0.0,0.0,0.0,0.0,0.0],
proposed_transactions: Box::new(|_| vec![4, 5, 6])
}
}
Here an economic agent is more abstract representation.
Now imagine that there a many different types of economic agents (banks, mines, farms, clothing stores etc). They all interact by proposing and accepting transactions. Option 1 seems to suffer from dynamic dispatch. Option 2 seems to be my own version of dynamic dispatch via a match expression so is probably no better, right? Option 3 seems like it should be the most performant but does not really allow much cognitive ease on the part of the programmer.
So finally the questions:
Clearly dynamic dispatch is involved in option 1. What about options 2 and 3?
Which is expected to be most performant? Note I am not really in a position to do testing as the full idea (only on paper right now) is obviously more complex than these snippets and the choice now will affect the entire structure for the whole project.
What would be an idiomatic choice here?

All your options use dynamic dispatch or branches in one way or another to call the right function for each element. The reason is that you are mixing all the agents into a single place, which is where the different performance penalties come from (not just the indirect calls or branches, but also cache misses etc.).
Instead, for a problem like this, you want to separate the different "agents" into separate, independent "entities". Then, to reuse code, you will want to factor out "components" for which subsets of them are iterated by "systems".
This is what is usually called an "Entity-Component-System" (ECS) of which there are many models and implementations. They are typically used by games and other simulations.
If you search for ECS you will find many questions, articles and so on about it and the different approaches.

What is Dynamic Dispatch?
Dynamic Dispatch is usually reserved to indirect function calls, ie function calls which occurs via a function pointer.
In your case, both Option 1 and Option 3 are cases of dynamic dispatch:
Traits use a virtual table, which is a table of function pointers.
fn(...) -> ... is a function pointer.
What is the performance penalty of Dynamic Dispatch?
At run-time, there is little to no difference between a regular function call and a so-called virtual call:
Indirect function calls can be predicted, there's a special predictor for them in your CPU.
The cost of the function call is mostly saving/restoring registers, which happen in both cases.
The performance penalty is more insidious, it happens at compile-time.
The mother of optimizations is inlining, which essentially copy/paste the body of the function being called right at the call-site. Once a function is inlined, many other optimization passes can go to town on the (combined) code. This is especially lucrative on very small functions (a getter), but can also be quite beneficial on larger functions.
An indirect function call, however, is opaque. There are many candidate functions, and thus the optimizer cannot perform inlining... nipping many potential optimizations in the bud. Devirtualization is sometimes available -- the compiler deducing which function(s) can be called -- but should not be relied on.
Which Option to choose?
Among those presented: Option 2!
The main advantage of Option 2 is that there is no indirect function calls. In both branches of your match, the compiler has a known type for the receiver of the method and can therefore inline the method if suitable, enabling all the optimizations.
Is there better?
With an open-design, an Array of Structs is a better way to structure the system, mostly avoiding branch misprediction:
EconomicAgents {
Person(Vec<Person>),
FoodStore(Vec<FoodStore>),
}
This is the core design of the ECS solution proposed by #Acorn.
Note: as noted by #Acorn in comments, Array of Structs also is close to optimal cache wise -- no indirection, very little padding between elements.
Going with a full ECS is a trickier proposition. Unless you have dynamic entities -- Person/FoodStores are added/removed during the runs -- I would not bother. ECS are helpful for dynamism, but have to choose a trade-off between various characteristics: do you want faster add/remove, or faster iteration? Unless you need all their features, they will likely add their own overhead due to trade-offs that do not match your needs.

How to avoid dynamic dispatch?
You could go with option 1 and instead of having vector of trait objects, keep each type in its own vector and iterate them individually. It is not nice solution, so...
Instead...
Choose whichever option allows you to model your simulation best and don't worry about the const of dynamic dispatch. The overhead is small. There are other things that will impact the performance more, such as allocating new Vec for every call.
The main cost of dynamic dispatch is indirect branch predictor making wrong guess. To help your cpu make better guesses, you may try to keep objects of the same type next to each other in the vector. For example sort it by type.
Which one is idiomatic?
The option 1 has an issue that to store objects of different types into vector, you must go thru indirection. The easiest is to Box every object, but that means every access will have not only dynamic dispatch for the function, but will also have to follow extra pointer to get to the data.
The option 2 with enum is (in my opinion) more idiomatic - you have all your data together in continuous memory. But beware that the size of an enum is bigger (or equal) to its largest variant. So if you agents vary in size, it may be better to go with option 1.

What costs are incurred when using Cell<T> as opposed to just T?

I ran across a comment on reddit that indicates that using Cell<T> prevents certain optimizations from occurring:
Cell works with no memory overhead (Cell is the same size as T) and little runtime overhead (it "just" inhibits optimisations, it doesn't introduce extra explicit operations)
This seems counter to other things I've read about Cell<T>, in particular that it's "zero-cost." The first place I encountered this categorization is here.
With all that said, I'd like to understand the actual cost of using Cell<T>, including whatever optimizations it may prevent.

TL;DR Cell is Zero-Overhead Abstraction; that is, the same functionality implemented manually has the same cost.
The term Zero-Cost Abstractions is not English, it's jargon. The idea of Zero-Cost Abstractions is that the layer of abstraction itself does not add any cost compared to manually doing the same thing.
There are various misunderstandings that have sprung up: most notably, I have regularly seen zero-cost understood as "the operation is free", which is not the case.
To add to the confusion, the exception mechanism used by most C++ implementations, and which Rust uses for panic = unwind is called Zero-Cost Exceptions, and purports1 to add no overhead on the non-throwing path. It's a different kind of Zero-Cost...
Lately, my recommendation is to switch to using the term Zero-Overhead Abstractions: first because it's a distinct term from Zero-Cost Exceptions, so less likely to be mistaken, and second because it emphasizes that the Abstraction does not add Overhead, which is what we are trying to convey in the first place.
1 The objective is only partially achieved. While the same assembly executed with and without the possibility of throwing indeed has the same performance, the presence of potential exceptions may hinder the optimizer and cause it to generate sub-optimal assembly in the first place.
With all that said, I'd like to understand the actual cost of using Cell<T>, including whatever optimizations it may prevent.
On the memory side, there is no overhead:
sizeof::<Cell<T>>() == sizeof::<T>(),
given a cell of type Cell<T>, &cell == cell.as_ptr().
(You can peek at the source code)
On the access side, Cell<T> does incur a run-time cost compared to T; the cost of the extra functionality.
The most immediate cost is that manipulating the value through a &Cell<T> requires copying it back and forth1. This is a bitwise copy, so the optimizer may elide it, if it can prove that it is safe to do so.
Another notable cost is that UnsafeCell<T>, on which Cell<T> is based, breaks the rules that &T means that T cannot be modified.
When a compiler can prove that a portion of memory cannot be modified, it can optimize out further reads: read t.foo in a register, then use the register value rather than reading t.foo again.
In traditional Rust code, a &T gives such a guarantee: no matter if there are opaque function calls, calls to C code, etc... between two reads to t.foo, the second read will return the same value as the first, guaranteed. With a &Cell<T>, there is no such guarantee any longer, and thus unless the optimizer can prove beyond doubt that the value is unmodified2, then it cannot apply such optimizations.
1 You can manipulate the value at no cost through &mut Cell<T> or using unsafe code.
2 For example, if the optimizer knows that the value resides on the stack, and it never passed the address of the value to anyone else, then it can reasonably conclude that no one else can modify the value. Although a stack-smashing attack may, of course.

When should inline be used in Rust?

Rust has an "inline" attribute that can be used in one of those three flavors:
#[inline]
#[inline(always)]
#[inline(never)]
When should they be used?
In the Rust reference, we see an inline attributes section saying
The compiler automatically inlines functions based on internal heuristics. Incorrectly inlining functions can actually make the program slower, so it should be used with care.
In the Rust internals forum, huon was also conservative about specifying inline.
But we see considerable usage in the Rust source, including the standard library. A lot of inline attributes are added to one-line-functions, which should be easy for the compilers to spot and optimize through heuristics according to the reference. Are those in fact not needed?

One limitation of the current Rust compiler is that it if you're not using LTO (Link-Time Optimization), it will never inline a function not marked #[inline] across crates. Rust uses a separate compilation model similar to C++ because LLVM's LTO implementation doesn't scale well to large projects. Therefore, small functions exposed to other crates need to be marked by hand. This isn't a great situation, and it's likely to be fixed in the future by some combination of improvements to LTO and MIR inlining.
#[inline(never)] is sometimes useful for debugging (separating a piece of code which isn't working as expected). In theory, it can be used for benchmarking, but that's usually a bad idea: turning off inlining doesn't prevent other inter-procedural optimizations like constant propagation. In terms of normal code, it can reduce codesize if you have a frequently used helper function which is only used for error handling.
#[inline(always)] is generally bad idea; if a function is big enough that the compiler won't inline it by default, it's big enough that the overhead of the call doesn't matter (and excessive inlining increases instruction cache pressure). There are exceptions, but you need performance measurements to justify it. This example is the sort of situation where it's worth considering. #[inline(always)] can also be used to improve -O0 code quality, but that's not usually worth worrying about.

Is it possible to control the size of an array using the type parameter of a generic?

What follows is just used as an example, and not valid Rust code.
struct Vec<T: Sized, Count> {
a: [T; Count]
}
Something like it is possible in C++ templates, but I haven't seen it in Rust.

Rust 1.51
Use const generics:
struct Vec<T: Sized, const COUNT: usize> {
a: [T; COUNT],
}
Previous versions
RFC 2000 — const generics introduces support for this and progress is tracked in issue #44580.
If you look at the design of Rust, you will notice that it started first by tackling the hardest problems (memory-safe, data-race free) but there are otherwise lots of areas where it is "incomplete" (compared to what could be achieved).
In particular, generic structures and functions started out somewhat limited:
lack of Higher Kinded Types (HKT)
lack of non-type parameters => arrays are special-cased, and implementing a trait for an array is a known issue, the work-around being to implement it for a few different dimensions
lack of variadic parameters => tuples are special-cased, and implementing a trait for all tuples is similarly difficult
For the moment, not all of these are implemented, not because they are not desired but simply because time was lacking. The idea of Rust 1.0 was not to release a final product that would not evolve, but a stable base from which to start; some or maybe all will come.

While waiting for Rust to gain first-class support for this, there are crates that provide certain levels of this functionality, such as:
typenum
generic-array

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string