What costs are incurred when using Cell<T> as opposed to just T? - rust

I ran across a comment on reddit that indicates that using Cell<T> prevents certain optimizations from occurring:
Cell works with no memory overhead (Cell is the same size as T) and little runtime overhead (it "just" inhibits optimisations, it doesn't introduce extra explicit operations)
This seems counter to other things I've read about Cell<T>, in particular that it's "zero-cost." The first place I encountered this categorization is here.
With all that said, I'd like to understand the actual cost of using Cell<T>, including whatever optimizations it may prevent.

TL;DR Cell is Zero-Overhead Abstraction; that is, the same functionality implemented manually has the same cost.
The term Zero-Cost Abstractions is not English, it's jargon. The idea of Zero-Cost Abstractions is that the layer of abstraction itself does not add any cost compared to manually doing the same thing.
There are various misunderstandings that have sprung up: most notably, I have regularly seen zero-cost understood as "the operation is free", which is not the case.
To add to the confusion, the exception mechanism used by most C++ implementations, and which Rust uses for panic = unwind is called Zero-Cost Exceptions, and purports1 to add no overhead on the non-throwing path. It's a different kind of Zero-Cost...
Lately, my recommendation is to switch to using the term Zero-Overhead Abstractions: first because it's a distinct term from Zero-Cost Exceptions, so less likely to be mistaken, and second because it emphasizes that the Abstraction does not add Overhead, which is what we are trying to convey in the first place.
1 The objective is only partially achieved. While the same assembly executed with and without the possibility of throwing indeed has the same performance, the presence of potential exceptions may hinder the optimizer and cause it to generate sub-optimal assembly in the first place.
With all that said, I'd like to understand the actual cost of using Cell<T>, including whatever optimizations it may prevent.
On the memory side, there is no overhead:
sizeof::<Cell<T>>() == sizeof::<T>(),
given a cell of type Cell<T>, &cell == cell.as_ptr().
(You can peek at the source code)
On the access side, Cell<T> does incur a run-time cost compared to T; the cost of the extra functionality.
The most immediate cost is that manipulating the value through a &Cell<T> requires copying it back and forth1. This is a bitwise copy, so the optimizer may elide it, if it can prove that it is safe to do so.
Another notable cost is that UnsafeCell<T>, on which Cell<T> is based, breaks the rules that &T means that T cannot be modified.
When a compiler can prove that a portion of memory cannot be modified, it can optimize out further reads: read t.foo in a register, then use the register value rather than reading t.foo again.
In traditional Rust code, a &T gives such a guarantee: no matter if there are opaque function calls, calls to C code, etc... between two reads to t.foo, the second read will return the same value as the first, guaranteed. With a &Cell<T>, there is no such guarantee any longer, and thus unless the optimizer can prove beyond doubt that the value is unmodified2, then it cannot apply such optimizations.
1 You can manipulate the value at no cost through &mut Cell<T> or using unsafe code.
2 For example, if the optimizer knows that the value resides on the stack, and it never passed the address of the value to anyone else, then it can reasonably conclude that no one else can modify the value. Although a stack-smashing attack may, of course.

Related

To what extent is Rust shadowing zero-cost?

A Zero-Runtime-Cost Mixed List in Rust outlines how to create a heterogenous list in Rust using tuples and normal traits (not trait objects like this question suggests). The list seems to rely heavily on shadowing and effectively changes the entire type of the list every time a new element is added.
The implementation seems brilliant to me, but after reviewing a few Rust's homepage and resources I could not find anyplace that explicitly defines shadowing as zero-cost. As far as I know, repeatedly abandoning data on the stack is less costly than indirection, but repeatedly copying and adding to existing data instead of mutating it sounds pretty expensive.
What you don’t use, you don’t pay for. And further: What you do use, you couldn’t hand code any better.
Bjarne Stroustrup
Shadowing seems to fulfill the first requirement, but the second?
Is Rust's shadowing actually zero-cost?
Official Rust material tries very hard to never talk about "zero cost" by itself, so you'll have to cite where you see "zero cost" without further qualification. The article states zero runtime cost, so the author of the post is aware of that. In most cases, "zero cost" is used in the context of zero-cost abstractions.
Your Stroustrup quote only partially and obliquely deals with zero-cost abstractions. A better explanation, emphasis mine:
It means paying no penalty for the abstraction, or said otherwise, it means that whether you use the abstraction or instead go for the "manual" implementation you end up having the same costs (same speed, same memory consumption, ...).
Matthieu M.
This means that any time you see "zero-cost abstraction", you have to have something to compare the abstraction against; only then can you tell if it's truly zero-cost.
I don't think that shadowing even counts as an abstraction, but let's pretend it does (and I'll word the rest of my answer as if I believe it is).
Shadowing a variable means having multiple distinct variables with the same name, with the later ones precluding access to the previous ones. The non-"abstract" version of that is having multiple distinct variables of different names. I'd say that having two variables of the same name is the same cost as having two variables of different names, so it is a zero-cost abstraction.
See also:
Why do I need rebinding/shadowing when I can have mutable variable binding?
In Rust, what's the difference between "shadowing" and "mutability"?
Playing the game further, you can ask "is having two variables a zero-cost abstraction?". I'd say that this depends on what the variables are and how they relate.
In this example, I'd say that this is a zero-cost abstraction as there's no more efficient way I could have written the code.
fn example() {
let a = String::new();
let a = a;
}
On the other hand, I'd say that this is not a zero-cost abstraction, as the first a will not be deallocated until the end of the function:
fn example() {
let a = String::new();
let a = String::new();
}
A better way I could choose to write it would be to call drop in the middle. There are good reasons that Rust doesn't do this, but it's not as efficient in regards to memory usage as a hand-written implementation could be.
See also:
Is it possible in Rust to delete an object before the end of scope?
Does Rust free up the memory of overwritten variables?
Is the resource of a shadowed variable binding freed immediately?

Is `u32`/`i32` suggested even on limited range number case?

Should we use u32/i32 or it's lower variant (u8/i8, u16/i16) when dealing with limited range number like "days in month" which ranged from 1-30 or "score of a subject" which ranged from 0 to 100? Or why we shouldn't?
Is there any optimization or benefit on the lower variant (i.e. memory efficient)?
Summary
Correctness should be prioritized over performance and correctness-wise (for ranges like 1–100), all solutions (u8, u32, ...) are equally bad. The best solution would be to create a new type to benefit from strong typing.
The rest of my answer tries to justify this claim and discusses different ways of creating the new type.
More explanation
Let's take a look at the "score of subject" example: the only legal values are 0–100. I'd argue that correctness-wise, using u8 and u32 is equally bad: in both cases, your variable can hold values that are not legal in your semantic context; that's bad!
And arguing that the u8 is better, because there are less illegal values, is like arguing that wrestling a bear is better than walking through New York, because you only have one possibility of dying (blood loss by bear attack) as opposed to the many possibilities of death (car accident, knife attack, drowning, ...) in New York.
So what we want is a type that guarantees to hold only legal values. We want to create a new type that does exactly this. However, there are multiple ways to proceed; each with different advantages and disadvantages.
(A) Make the inner value public
struct ScoreOfSubject(pub u8);
Advantage: at least APIs are more easy to understand, because the parameter is already explained by the type. What is easier to understand:
add_record("peter", 75, 47) or
add_record("peter", StudentId(75), ScoreOfSubject(47))?
I'd say the latter one ;-)
Disadvantage: we don't actually do any range checking and illegal values can still occur; bad!.
(B) Make inner value private and supply a range checking constructor
struct ScoreOfSubject(pub u8);
impl ScoreOfSubject {
pub fn new(value: u8) -> Self {
assert!(value <= 100);
ScoreOfSubject(value)
}
pub fn get(&self) -> u8 { self.0 }
}
Advantage: we enforce legal values with very little code, yeah :)
Disadvantage: working with the type can be annoying. Pretty much every operation requires the programmer to pack & unpack the value.
(C) Add a bunch of implementations (in addition to (B))
(the code would impl Add<_>, impl Display and so on)
Advantage: the programmer can use the type and do all useful operations on it directly -- with range checking! This is pretty optimal.
Please take a look at Matthieu M.'s comment:
[...] generally multiplying scores together, or dividing them, does not produce a score! Strong typing not only enforces valid values, it also enforces valid operations, so that you don't actually divide two scores together to get another score.
I think this is a very important point I failed to make clear before. Strong typing prevents the programmer from executing illegal operations on values (operations that don't make any sense). A good example is the crate cgmath which distinguishes between point and direction vectors, because both support different operations on them. You can find additional explanation here.
Disadvantage: a lot of code :(
Luckily the disadvantage can be reduced by the Rust macro/compiler plugin system. There are crates like newtype_derive or bounded_integer that do this kind of code generation for you (disclaimer: I never worked with them).
But now you say: "you can't be serious? Am I supposed to spend my time writing new types?".
Not necessarily, but if you are working on production code (== at least somewhat important), then my answer is: yes, you should.
A no-answer answer: I doubt you would see any difference in benchmarks, unless you do A LOT of arithmetic or process HUGE arrays of numbers.
You should probably just go with the type which makes more sense (no reason to use negatives or have an upper bound in millions for a day of month) and provides the methods you need (e.g. you can't perform abs() directly on an unsigned integer).
There could be major benefits using smaller types but you would have to benchmark your application on your target platform to be sure.
The first and most easily realized benefit from the lower memory footprint is better caching. Not only is your data more likely to fit into the cache, but it is also less likely to discard other data in the cache, potentially improving a completely different part of your application. Whether or not this is triggered depends on what memory your application touches and in what order. Do the benchmarks!
Network data transfers have an obvious benefit from using smaller types.
Smaller data allows "larger" instructions. A 128-bit SIMD unit can handle 4 32-bit data OR 16 8-bit data, making certain operations 4 times faster. In benchmarks I´ve made these instructions do execute 4 times faster indeed BUT the whole application improved by less than 1%, and the code became more of a mess. Shaping your program into making better use of SIMD can be tricky.
As of signed/unsigned discussions unsigned has slightly better properties which a compiler may or may not take advantage of.

How do Rust's ownership semantics relate to uniqueness typing as found in Clean and Mercury?

I noticed that in Rust moving is applied to lvalues, and it's statically enforced that moved-from objects are not used.
How do these semantics relate to uniqueness typing as found in Clean and Mercury? Are they the same concept? If not, how do they differ?
The concept of ownership in Rust is not the same as uniqueness in Mercury and Clean, although they are related in that they both aim to provide safety via static checking, and they are both defined in terms of the number of references within a scope. The key differences are:
Uniqueness is a more abstract concept. While it can be interpreted as saying that a reference to a memory location is unique, like Rust's lvalues, it can also apply to abstract values such as the state of every object in the universe, to give an extreme but typical example. There is no pointer corresponding to such a value - it cannot be opened up and inspected within a debugger or anything like that - but it can be used through an interface just like any other abstract type. The aim is to give a value-oriented semantics that remains consistent in the presence of statefulness.
In Mercury, at least (I can't speak for Clean), uniqueness is a more limited concept than ownership, in that there must be exactly one reference. You can't share several copies of a reference on the proviso that they will not be written to, as can be done in Rust. You also can't lend a reference for writing but get it back later after the borrower has finished with it.
Declaring something unique in Mercury does not guarantee that writing to references will occur, just that the compiler will check that it would be safe to do so; it is still valid for an implementation to copy the contents of a unique reference rather than update in place. The compiler will arrange for the update in place if it deems it appropriate at its given optimization level. Alternatively, authors of abstract types may perform similar (or sometimes drastically better) optimizations manually, safe in the knowledge that users will be forced to use the abstract type in a way that is consistent with them. Ownership in Rust, on the other hand, is more directly connected to the memory model and gives stronger guarantees about behaviour.

Multiple specialization, iterator patterns in Rust

Learning Rust (yay!) and I'm trying to understand the intended idiomatic programming required for certain iterator patterns, while scoring top performance. Note: not Rust's Iterator trait, just a method I've written accepting a closure and applying it to some data I'm pulling off of disk / out of memory.
I was delighted to see that Rust (+LLVM?) took an iterator I had written for sparse matrix entries, and a closure for doing sparse matrix vector multiplication, written as
iterator.map_edges({ |x, y| dst[y] += src[x] });
and inlined the closure's body in the generated code. It went quite fast. :D
If I create two of these iterators, or use the first a second time (not a correctness issue) each instance slows down quite a lot (about 2x in this case), presumably because the optimizer no longer chooses to do specialization because of the multiple call sites, and you end up doing a function call for each element.
I'm trying to understand if there are idiomatic patterns that keep the pleasant experience above (I like it, at least) without sacrificing the performance. My options seem to be (none satisfying this constraint):
Accept dodgy performance (2x slower is not fatal, but no prizes either).
Ask the user to supply a batch-oriented closure, so acting on an iterator over a small batch of data. This exposes a bit much of the internals of the iterator (the data are compressed nicely, and the user needs to know how to unwrap them, or the iterator needs to stage an unwrapped batch in memory).
Make map_edges generic in a type implementing a hypothetical EdgeMapClosure trait, and ask the user to implement such a type for each closure they want to inline. Not tested, but I would guess this exposes distinct methods to LLVM, each of which get nicely inlined. Downside is that the user has to write their own closure (packing relevant state up, etc).
Horrible hacks, like make distinct methods map_edges0, map_edges1, ... . Or add a generic parameter the programmer can use to make the methods distinct, but which is otherwise ignored.
Non-solutions include "just use for pair in iterator.iter() { /* */ }"; this is prep work for a data/task-parallel platform, and I would like to be able to capture/move these closures to work threads rather than capturing the main thread's execution. Maybe the pattern I should be using is to write the above, put it in a lambda/closure, and ship it around instead?
In a perfect world, it would be great to have a pattern which causes each occurrence of map_edges in the source file to result in different specialized methods in the binary, without forcing the entire project to be optimized at some scary level. I'm coming out of an unpleasant relationship with managed languages and JITs where generics would be the only way (I know of) to get this to happen, but Rust and LLVM seem magical enough that I thought there might be a good way. How do Rust's iterators handle this to inline their closure bodies? Or don't they (they should!)?
It seems that the problem is resolved by Rust's new approach to closures outlined at
http://smallcultfollowing.com/babysteps/blog/2014/11/26/purging-proc/
In short, Option 3 above (make functions generic with respect to a new closure type) is now transparently implemented when you make an implementation generic using the new closure traits. Rust produces the type behind the scenes for you.

Suitable Haskell type for large, frequently changing sequence of floats

I have to pick a type for a sequence of floats with 16K elements. The values will be updated frequently, potentially many times a second.
I've read the wiki page on arrays. Here are the conclusions I've drawn so far. (Please correct me if any of them are mistaken.)
IArrays would be unacceptably slow in this case, because they'd be copied on every change. With 16K floats in the array, that's 64KB of memory copied each time.
IOArrays could do the trick, as they can be modified without copying all the data. In my particular use case, doing all updates in the IO monad isn't a problem at all. But they're boxed, which means extra overhead, and that could add up with 16K elements.
IOUArrays seem like the perfect fit. Like IOArrays, they don't require a full copy on each change. But unlike IOArrays, they're unboxed, meaning they're basically the Haskell equivalent of a C array of floats. I realize they're strict. But I don't see that being an issue, because my application would never need to access anything less than the entire array.
Am I right to look to IOUArrays for this?
Also, suppose I later want to read or write the array from multiple threads. Will I have backed myself into a corner with IOUArrays? Or is the choice of IOUArrays totally orthogonal to the problem of concurrency? (I'm not yet familiar with the concurrency primitives in Haskell and how they interact with the IO monad.)
A good rule of thumb is that you should almost always use the vector library instead of arrays. In this case, you can use mutable vectors from the Data.Vector.Mutable module.
The key operations you'll want are read and write which let you mutably read from and write to the mutable vector.
You'll want to benchmark of course (with criterion) or you might be interested in browsing some benchmarks I did e.g. here (if that link works for you; broken for me).
The vector library is a nice interface (crazy understatement) over GHC's more primitive array types which you can get to more directly in the primitive package. As are the things in the standard array package; for instance an IOUArray is essentially a MutableByteArray#.
Unboxed mutable arrays are usually going to be the fastest, but you should compare them in your application to IOArray or the vector equivalent.
My advice would be:
if you probably don't need concurrency first try a mutable unboxed Vector as Gabriel suggests
if you know you will want concurrent updates (and feel a little brave) then first try a MutableArray and then do atomic updates with these functions from the atomic-primops library. If you want fine-grained locking, this is your best choice. Of course concurrent reads will work fine on whatever array you choose.
It should also be theoretically possible to do concurrent updates on a MutableByteArray (equivalent to IOUArray) with those atomic-primops functions too, since a Float should always fit into a word (I think), but you'd have to do some research (or bug Ryan).
Also be aware of potential memory reordering issues when doing concurrency with the atomic-primops stuff, and help convince yourself with lots of tests; this is somewhat uncharted territory.

Resources