When should inline be used in Rust? - rust

Rust has an "inline" attribute that can be used in one of those three flavors:
#[inline]
#[inline(always)]
#[inline(never)]
When should they be used?
In the Rust reference, we see an inline attributes section saying
The compiler automatically inlines functions based on internal heuristics. Incorrectly inlining functions can actually make the program slower, so it should be used with care.
In the Rust internals forum, huon was also conservative about specifying inline.
But we see considerable usage in the Rust source, including the standard library. A lot of inline attributes are added to one-line-functions, which should be easy for the compilers to spot and optimize through heuristics according to the reference. Are those in fact not needed?

One limitation of the current Rust compiler is that it if you're not using LTO (Link-Time Optimization), it will never inline a function not marked #[inline] across crates. Rust uses a separate compilation model similar to C++ because LLVM's LTO implementation doesn't scale well to large projects. Therefore, small functions exposed to other crates need to be marked by hand. This isn't a great situation, and it's likely to be fixed in the future by some combination of improvements to LTO and MIR inlining.
#[inline(never)] is sometimes useful for debugging (separating a piece of code which isn't working as expected). In theory, it can be used for benchmarking, but that's usually a bad idea: turning off inlining doesn't prevent other inter-procedural optimizations like constant propagation. In terms of normal code, it can reduce codesize if you have a frequently used helper function which is only used for error handling.
#[inline(always)] is generally bad idea; if a function is big enough that the compiler won't inline it by default, it's big enough that the overhead of the call doesn't matter (and excessive inlining increases instruction cache pressure). There are exceptions, but you need performance measurements to justify it. This example is the sort of situation where it's worth considering. #[inline(always)] can also be used to improve -O0 code quality, but that's not usually worth worrying about.

Related

Why are ceilf32 and sqrtf32 unsafe?

I'm pretty new to Rust and have been working on some mathematical problems. For one of these problems I needed ceilf32 and sqrtf32. I was surprised to find that these functions are unsafe; both are fairly simple mathematical functions and my understanding is that unsafe Rust is used only as necessary to work around either the conservatism of the compiler or to allow inherently unsafe OS operations. I can't see any reason either function would run into either issue, thus I can't understand what would stop them being implemented with memory safety.
Could someone please enlighten me?
The functions you're looking at are in core::intrinsics, which are low-level compiler instructions. I don't see any official documentation on why they're marked unsafe, but my guess is that all of the compiler intrinsics were marked that way as a rule, since they're lower-level than most of Rust proper.
Regardless, for normal operation, you're looking for the inherent methods f32::ceil and f32::sqrt. These are the Rust standard library implementations that presumably[1] call the intrinsics as a course of action, and these methods are not marked unsafe.
Since they're inherent methods, you can either call them on f32 objects (my_number.sqrt()) or directly with the namespace (f32::sqrt(my_number)).
[1] In fact, a look at the source code for the current implementations indicates that both of these simply delegate to their intrinsic counterpart, wrapping it in an unsafe block to guarantee safety.

Rust features which allow the optimizer to change the program's result?

In some languages, optimization is allowed to change the program execution result. For example,
C++11 has the concept of "copy-elision" which allows the optimizer to ignore the copy constructor (and its side-effects) in some circumstances.
Swift has the concept of "imprecise lifetimes" which allows the optimizer to release objects at any time after last usage before the end of lexical scope.
In both cases, optimizations are not guaranteed to happen, therefore the program execution result can be significantly different based on the optimizer implementations (e.g. debug vs. release build)
Copying can be skipped, object can die while a reference is alive. The only way to deal with these behaviors is by being defensive and making your program work correctly regardless if the optimizations happen or not. If you don't know about the existence of this behavior, it's impossible to write correct programs with the tools.
This is different from "random operations" which are written by the programmer to produce random results intentionally. These behaviors are (1) done by optimizer and (2) can randomize execution result regardless of programmer intention. This is done by the language designer's intention for better performance. A sort of trade-off between performance and predictability.
Does Rust have (or consider) any of this kind of behavior? Any optimization that is allowed to change program execution result for better performance. If it has any, what is the behavior and why is it allowed?
I know the term "execution result" could be vague, but I don't know a proper term for this. I'm sorry for that.
I'd like to collect every potential case here, so everyone can be aware of them and be prepared for them. Please post any case as an answer (or comment) if you think your case produces different results.
I think all arguable cases are worth to mention. Because someone can be helped a lot by reading the case details.
If you restrict yourself to safe Rust code, the optimizer shouldn't change the program result. Of course there are some optimizations that can be observable due to their very nature. For example removing unused variables can mean your code overflows the stack without optimizations, while everything will fit on the stack when compiled with optimizations. Or your code may just be too slow to ever finish when compiled without optimizations, which is also an observable difference. And with unsafe code triggering undefined behaviour anything can happen, including the optimizer changing the outcome of your code.
There are, however, a few cases where program execution can change depending on whether you are compiling in debug mode or in release mode:
Integer overflow will result in a panic in debug build, while integers wrap around according to the two's complement representation in release mode – see RFC 650 for details. This behaviour can be controlled with the -C overflow-checks codegen option, so you can disable overflow checks in debug mode or enable them in release mode if you want to.
The debug_assert!() macro defines assertions that are only executed in debug mode. There's again a manual override using the -C debug-assertions codegen option.
Your code can check whether debug assertions are enabled using the debug-assertions configuration option
These are all related to debug assertions in some way, but this list is not exhaustive. You can probably also inspect the environment to determine whether the code is compiled in debug or release mode, and change the behaviour based on this.
None of these examples really fall into the same category as your examples in the original question. Safe Rust code should generally behave the same regardless of whether you compile in debug mode or release mode.
There are far fewer foot-guns in Rust when compared to C++. In general, they revolve around unsafe, raw pointers and lifetimes derived from them or any form of undefined behavior, which is really undefined in Rust as well. However, if your code compiles (and, if in doubt, passes cargo miri test), you most likely won't see surprising behavior.
Two examples that come to mind which can be surprising:
The lifetime of a MutexGuard; the example comes from the book:
while let Ok(job) = receiver.lock().unwrap().recv() {
job();
}
One might think/hope that the Mutex on the receiver is released once a job has been acquired and job() executes while other threads can receive jobs. However, due to the way value-expressions in place-expressions contexts work in conjunction with temporary lifetimes (the MutexGuard needs an anonymous lifetime referencing receiver), the MutexGuard is held for the entirety of the while-block. This means only one thread will ever execute jobs.
If you do
loop {
let job = receiver.lock().unwrap().recv().unwrap();
job();
}
this will allow multiple threads to run in parallel. It's not obvious why this is.
Multiple times there have been questions regarding const. There is no guarantee by the compiler if a const actually exists only once (as an optimization) or is instantiated wherever it is used. The second case is the way one should think about const, there is no guarantee that this is what the compiler does, though. So this can happen:
const EXAMPLE: Option<i32> = Some(42);
fn main() {
assert_eq!(EXAMPLE.take(), Some(42));
assert_eq!(EXAMPLE, Some(42)); // Where did this come from?
}

Can I avoid using explicit lifetime specifiers and instead use reference counting (Rc)?

I am reading the Rust Book and everything was pretty simple to understand (thanks to the book's authors), until the section about lifetimes. I spent all day, reading a lot of articles on lifetimes and still I am very insecure about using them correctly.
What I do understand, though, is that the concept of explicit lifetime specifiers aims to solve the problem of dangling references. I also know that Rust has reference-counting smart pointers (Rc) which I believe is the same as shared_ptr in C++, which has the same purpose: to prevent dangling references.
Given that those lifetimes are so horrendous to me, and smart pointers are very familiar and comfortable for me (I used them in C++ a lot), can I avoid the lifetimes in favor of smart pointers? Or are lifetimes an inevitable thing that I'll have to understand and use in Rust code?
are lifetimes an inevitable thing that I'll have to understand and use in Rust code?
In order to read existing Rust code, you probably don't need to understand lifetimes. The borrow-checker understands them so if it compiles then they are correct and you can just review what the code does.
I am very insecure about using them correctly.
The most important thing to understand about lifetimes annotations is that they do nothing. Rather, they are a way to express to the compiler the relationship between references. For example, if an input and output to a function have the same lifetime, that means that the output contains a reference to the input (or part of it) and therefore is not allowed to live longer than the input. Using them "incorrectly" means that you are telling the compiler something about the lifetime of a reference which it can prove to be untrue - and it will give you an error, so there is nothing to be insecure about!
can I avoid the lifetimes in favor of smart pointers?
You could choose to avoid using references altogether and use Rc everywhere. You would be missing out on one of the big features of Rust: lifetimes and references form one of the most important zero-cost abstractions, which enable Rust to be fast and safe at the same time. There is code written in Rust that nobody would attempt to write in C/C++ because a human could never be absolutely certain that they haven't introduced a memory bug. Avoiding Rust references in favour of smart pointers will mostly result in slower code, because smart pointers have runtime overhead.
Many APIs use references. In order to use those APIs you will need to have at least some grasp of what is going on.
The best way to understand is just to write code and gain an intuition from what works and what doesn't. Rust's error messages are excellent and will help a lot with forming that intuition.

What costs are incurred when using Cell<T> as opposed to just T?

I ran across a comment on reddit that indicates that using Cell<T> prevents certain optimizations from occurring:
Cell works with no memory overhead (Cell is the same size as T) and little runtime overhead (it "just" inhibits optimisations, it doesn't introduce extra explicit operations)
This seems counter to other things I've read about Cell<T>, in particular that it's "zero-cost." The first place I encountered this categorization is here.
With all that said, I'd like to understand the actual cost of using Cell<T>, including whatever optimizations it may prevent.
TL;DR Cell is Zero-Overhead Abstraction; that is, the same functionality implemented manually has the same cost.
The term Zero-Cost Abstractions is not English, it's jargon. The idea of Zero-Cost Abstractions is that the layer of abstraction itself does not add any cost compared to manually doing the same thing.
There are various misunderstandings that have sprung up: most notably, I have regularly seen zero-cost understood as "the operation is free", which is not the case.
To add to the confusion, the exception mechanism used by most C++ implementations, and which Rust uses for panic = unwind is called Zero-Cost Exceptions, and purports1 to add no overhead on the non-throwing path. It's a different kind of Zero-Cost...
Lately, my recommendation is to switch to using the term Zero-Overhead Abstractions: first because it's a distinct term from Zero-Cost Exceptions, so less likely to be mistaken, and second because it emphasizes that the Abstraction does not add Overhead, which is what we are trying to convey in the first place.
1 The objective is only partially achieved. While the same assembly executed with and without the possibility of throwing indeed has the same performance, the presence of potential exceptions may hinder the optimizer and cause it to generate sub-optimal assembly in the first place.
With all that said, I'd like to understand the actual cost of using Cell<T>, including whatever optimizations it may prevent.
On the memory side, there is no overhead:
sizeof::<Cell<T>>() == sizeof::<T>(),
given a cell of type Cell<T>, &cell == cell.as_ptr().
(You can peek at the source code)
On the access side, Cell<T> does incur a run-time cost compared to T; the cost of the extra functionality.
The most immediate cost is that manipulating the value through a &Cell<T> requires copying it back and forth1. This is a bitwise copy, so the optimizer may elide it, if it can prove that it is safe to do so.
Another notable cost is that UnsafeCell<T>, on which Cell<T> is based, breaks the rules that &T means that T cannot be modified.
When a compiler can prove that a portion of memory cannot be modified, it can optimize out further reads: read t.foo in a register, then use the register value rather than reading t.foo again.
In traditional Rust code, a &T gives such a guarantee: no matter if there are opaque function calls, calls to C code, etc... between two reads to t.foo, the second read will return the same value as the first, guaranteed. With a &Cell<T>, there is no such guarantee any longer, and thus unless the optimizer can prove beyond doubt that the value is unmodified2, then it cannot apply such optimizations.
1 You can manipulate the value at no cost through &mut Cell<T> or using unsafe code.
2 For example, if the optimizer knows that the value resides on the stack, and it never passed the address of the value to anyone else, then it can reasonably conclude that no one else can modify the value. Although a stack-smashing attack may, of course.

Multiple specialization, iterator patterns in Rust

Learning Rust (yay!) and I'm trying to understand the intended idiomatic programming required for certain iterator patterns, while scoring top performance. Note: not Rust's Iterator trait, just a method I've written accepting a closure and applying it to some data I'm pulling off of disk / out of memory.
I was delighted to see that Rust (+LLVM?) took an iterator I had written for sparse matrix entries, and a closure for doing sparse matrix vector multiplication, written as
iterator.map_edges({ |x, y| dst[y] += src[x] });
and inlined the closure's body in the generated code. It went quite fast. :D
If I create two of these iterators, or use the first a second time (not a correctness issue) each instance slows down quite a lot (about 2x in this case), presumably because the optimizer no longer chooses to do specialization because of the multiple call sites, and you end up doing a function call for each element.
I'm trying to understand if there are idiomatic patterns that keep the pleasant experience above (I like it, at least) without sacrificing the performance. My options seem to be (none satisfying this constraint):
Accept dodgy performance (2x slower is not fatal, but no prizes either).
Ask the user to supply a batch-oriented closure, so acting on an iterator over a small batch of data. This exposes a bit much of the internals of the iterator (the data are compressed nicely, and the user needs to know how to unwrap them, or the iterator needs to stage an unwrapped batch in memory).
Make map_edges generic in a type implementing a hypothetical EdgeMapClosure trait, and ask the user to implement such a type for each closure they want to inline. Not tested, but I would guess this exposes distinct methods to LLVM, each of which get nicely inlined. Downside is that the user has to write their own closure (packing relevant state up, etc).
Horrible hacks, like make distinct methods map_edges0, map_edges1, ... . Or add a generic parameter the programmer can use to make the methods distinct, but which is otherwise ignored.
Non-solutions include "just use for pair in iterator.iter() { /* */ }"; this is prep work for a data/task-parallel platform, and I would like to be able to capture/move these closures to work threads rather than capturing the main thread's execution. Maybe the pattern I should be using is to write the above, put it in a lambda/closure, and ship it around instead?
In a perfect world, it would be great to have a pattern which causes each occurrence of map_edges in the source file to result in different specialized methods in the binary, without forcing the entire project to be optimized at some scary level. I'm coming out of an unpleasant relationship with managed languages and JITs where generics would be the only way (I know of) to get this to happen, but Rust and LLVM seem magical enough that I thought there might be a good way. How do Rust's iterators handle this to inline their closure bodies? Or don't they (they should!)?
It seems that the problem is resolved by Rust's new approach to closures outlined at
http://smallcultfollowing.com/babysteps/blog/2014/11/26/purging-proc/
In short, Option 3 above (make functions generic with respect to a new closure type) is now transparently implemented when you make an implementation generic using the new closure traits. Rust produces the type behind the scenes for you.

Resources