Prefer &str over String? Is it always the case? - string

The documentation of Rust suggests to use &str whenever it's possible and only when it's not, use String. Is it always the case? For example, I'm building the client for REST API of a web-service and I have an entity:
struct User {
id: &str // or String?
name: &str // or String?
//......
}
So is it better to use &str or String in general and in this particular case?

In Rust everything related to a decision whether to use a reference or not stems from the basic concepts of ownership and borrowing and their applications. When you design your data structures, there is no clean rule: it wholly depends on your exact use case.
For example, if your data structure is intended to provide a view into some other data structure (like iterators do), then it makes sense to use references and slices as its fields. If, on the other hand, your structure is a DTO, it is more natural to make it own all of its data.
I believe that a suggestion to use &str where possible is more applicable to function definitions, and in this case it indeed is natural: if you make your functions accept &str instead of String, their caller will be able to use them easily and with no cost if they have either String or &str; on the other hand, if your functions accept Strings, then if their caller has &str, they will be forced to allocate a new String, and even if they have String but don't want to give up ownership, they still would need to clone it.
But of course there are exceptions: sometimes you do want to transfer ownership inside a function. Some data structures and traits, like Option or Reader, provide an ability to turn an owned variant to a borrowed one (Option::as_ref() and Reader::by_ref()), which are sometimes useful. There is also a Cow type which kind of "abstracts" over ownership, allowing you to pass a borrowed value which will be cloned if necessary. Sometimes there is a trait like BytesContainer which abstracts over various types, owning as well as borrowing, and which allows the caller to pass values of different types.
What I wanted to stress, again, is that there is no fixed rule, and it wholly depends on concrete task you're working on. You should use common sense and ownership/borrowing concepts when you architect your data structures.
In your particular case whether to use String or &str depends on what you will actually do with User objects - just "REST API client" is unfortunately too vague. It depends on your architecture. If these objects are used solely to perform an HTTP request, but the data is actually stored in some other source, then you would likely want to use &strs. If, on the other hand, User objects are used across your entire program, then it makes sense to make them own the data with Strings.

Related

Runtime Building: String not found in this scope

A common problem substrate developers might run into: developing a custom pallet to store the mapping into storage with common types, such as String. As an example:
#[derive(Encode, Decode, Clone, Default, RuntimeDebug)]
pub struct ClusterMetadata {
ip_address: String,
namespace: String,
whitelisted_ips: String,
}
On building the runtime, you get this error for every String:
|
21 | ip_address: String,
| ^^^^^^ not found in this scope
Why are Strings not included in scope? And other std rust types?
The error here is not related to no_std, so you probably just need to import the String type to get the real errors with using strings in the runtime.
The real issue you will find is that String is not encodable by Parity SCALE Codec, which is obviously a requirement for any storage item (or most any type you want to use) in the runtime.
So the question is "Why does SCALE not encode String"?
This is by choice. In general, String is surprisingly complex type. The Rust book spends a whole section talking about the complexities of the type.
As such, it can easily become a footgun within the runtime environment that people use Strings incorrectly.
Furthermore, it is generally bad practice to store Strings in runtime storage. I think we can easily agree that minimizing storage usage in the runtime is a best practice, and thus you should only put into storage items which you need to be able to derive consensus and state transitions in your runtime. Most often, String data would be used for metadata, and this kind of usage is not best practice.
If you look more closely at Substrate, you will find that we break this best practice more than once, but this is a decision we explicitly make, having the information at hand to be able to correctly evaluate the cost/benefit.
All of this combined is why Strings are not treated as a first class object in the runtime. Instead, we ask users to encode strings into bytes, and then work with that byte array instead.

How to decide when function input params should be references or not?

When writing a function how does one decide whether to make input parameters referenced or consumed?
For example, should I do this?
fn foo(val: Bar) -> bool { check(val) } // version 1
Or use referenced param instead?
fn foo(val: &Bar) -> bool { check(*val) } // version 2
On the client side, if I only had the second version but wanted to have my value consumed, I'd have to do something like this:
// given in: Bar
let out = foo(&in); // using version 2 but wanting to consume ownership
drop(in);
On the other hand, if I only had the first version but wanted to keep my reference, I'd have to do something like this:
// given in: &Bar
let out = foo(in.clone()); // using version 1 but wanting to keep reference alive
So which is preferred, and why?
Are there any performance considerations in making this choice? Or does the compiler make them equivalent in terms of performance, and how?
And when would you want to offer both versions (via traits)? And for those times how do you write the underlying implementations for both functions -- do you duplicate the logic in each method signature or do you have one proxy to the other? Which to which, and why?
Rust's goal is to have performance and syntax similar to C/C++ without the memory problems. To do this it avoids things like garbage collection and instead enforces a particular strict memory model of "ownership" and "borrowing". These are critical concepts in Rust. I would suggest reading Understanding Ownership in The Rust Book.
The rules of memory ownership are...
Each value in Rust has a variable that’s called its owner.
There can only be one owner at a time.
When the owner goes out of scope, the value will be dropped.
Enforcing a single owner avoids a great many bugs and complications typical of C and C++ programs while avoiding complex and slow memory management at runtime.
You can't get very far with only that, so Rust provides references. A reference lets functions safely "borrow" data without taking ownership. You can have either as many immutable references as you like, or only one mutable reference.
When applied to function calls, passing a value passes ownership to the function. Passing a reference is "borrowing", ownership is retained.
It's really, really important to understand ownership, borrowing, and later on, lifetimes. But here's some rules of thumb.
If your function needs to take ownership of the data, pass by value.
If your function only needs to read the data, pass a reference.
If your function needs to change the data, pass a mutable reference.
Note what's not in there: performance. Let the compiler take care of that.
Assuming check only reads data and checks that it's ok, it should take a reference. So your example would be...
fn foo(val: &Bar) -> bool { check(val) }
On the client side, if I only had the second version but wanted to have my value consumed...
There's no reason to want a function which takes a reference to do that. If it's the function's job to manage the memory, you pass it ownership. If it isn't, it's not its job to manage your memory.
There's also no need to manually call drop. You'd simply let the variable fall out of scope and it will be automatically dropped.
And when would you want to offer both versions (via traits)?
You wouldn't. If a function can take a reference there's no reason for it to take ownership.
If the function needs ownership, you should pass by value. If the function only needs a reference, you should pass by reference.
Passing by value fn foo(val: Bar) when it isn't necessary for the function to work could require the user to clone the value. Passing by reference is preferred in this case since a clone can be avoided.
Passing by reference fn foo(val: &Bar) when the function needs ownership would require it to either copy or clone the value. Pass by value is preferred in this case because it gives the user control whether an existing value's ownership is transferred or is cloned. The function doesn't have to make that decision and a clone can be avoided.
There are some exceptions, simple primitives like i32 can be passed-by-value without any performance penalty and may be more convenient.
And when would you want to offer both versions (via traits)?
You could use the Borrow trait:
fn foo<B: Borrow<Bar>>(val: B) -> bool {
check(val.borrow())
}
let b: Bar = ...;
foo(&b); // both of
foo(b); // these work

When should I use a reference instead of transferring ownership?

From the Rust book's chapter on ownership, non-copyable values can be passed to functions by either transferring ownership or by using a mutable or immutable reference. When you transfer ownership of a value, it can't be used in the original function anymore: you must return it back if you want to. When you pass a reference, you borrow the value and can still use it.
I come from languages where values are immutable by default (Haskell, Idris and the like). As such, I'd probably never think about using references at all. Having the same value in two places looks dangerous (or, at least, awkward) to me. Since references are a feature, there must be a reason to use them.
Are there situations I should force myself to use references? What are those situations and why are they beneficial? Or are they just for convenience and defaulting to passing ownership is fine?
Mutable references in particular look very dangerous.
They are not dangerous, because the Rust compiler will not let you do anything dangerous. If you have a &mut reference to a value then you cannot simultaneously have any other references to it.
In general you should pass references around. This saves copying memory and should be the default thing you do, unless you have a good reason to do otherwise.
Some good reasons to transfer ownership instead:
When the value's type is small in size, such as bool, u32, etc. It's often better performance to move/copy these values to avoid a level of indirection. Usually these values implement Copy, and actually the compiler may make this optimisation for you automatically. Something it's free to do because of a strong type system and immutability by default!
When the value's current owner is going to go out of scope, you may want to move the value somewhere else to keep it alive.

What's the right way to have a thread-safe lazy-initialized possibly mutable value in Rust?

I have a struct that contains a field that is rather expensive to initialize, so I want to be able to do so lazily. However, this may be necessary in a method that takes &self. The field also needs to be able to modified once it is initialized, but this will only occur in methods that take &mut self.
What is the correct (as in idiomatic, as well as in thread-safe) way to do this in Rust? It seems to me that it would be trivial with either of the two constraints:
If it only needed to be lazily initialized, and not mutated, I could simply use lazy-init's Lazy<T> type.
If it only needed to be mutable and not lazy, then I could just use a normal field (obviously).
However, I'm not quite sure what to do with both in place. RwLock seems relevant, but it appears that there is considerable trickiness to thread-safe lazy initialization given what I've seen of lazy-init's source, so I am hesitant to roll my own solution based on it.
The simplest solution is RwLock<Option<T>>.
However, I'm not quite sure what to do with both in place. RwLock seems relevant, but it appears that there is considerable trickiness to thread-safe lazy initialization given what I've seen of lazy-init's source, so I am hesitant to roll my own solution based on it.
lazy-init uses tricky code because it guarantees lock-free access after creation. Lock-free is always a bit trickier.
Note that in Rust it's easy to tell whether something is tricky or not: tricky means using an unsafe block. Since you can use RwLock<Option<T>> without any unsafe block there is nothing for you to worry about.
A variant to RwLock<Option<T>> may be necessary if you want to capture a closure for initialization once, rather than have to pass it at each potential initialization call-site.
In this case, you'll need something like RwLock<SimpleLazy<T>> where:
enum SimpleLazy<T> {
Initialized(T),
Uninitialized(Box<FnOnce() -> T>),
}
You don't have to worry about making SimpleLazy<T> Sync as RwLock will take care of that for you.

Multiple specialization, iterator patterns in Rust

Learning Rust (yay!) and I'm trying to understand the intended idiomatic programming required for certain iterator patterns, while scoring top performance. Note: not Rust's Iterator trait, just a method I've written accepting a closure and applying it to some data I'm pulling off of disk / out of memory.
I was delighted to see that Rust (+LLVM?) took an iterator I had written for sparse matrix entries, and a closure for doing sparse matrix vector multiplication, written as
iterator.map_edges({ |x, y| dst[y] += src[x] });
and inlined the closure's body in the generated code. It went quite fast. :D
If I create two of these iterators, or use the first a second time (not a correctness issue) each instance slows down quite a lot (about 2x in this case), presumably because the optimizer no longer chooses to do specialization because of the multiple call sites, and you end up doing a function call for each element.
I'm trying to understand if there are idiomatic patterns that keep the pleasant experience above (I like it, at least) without sacrificing the performance. My options seem to be (none satisfying this constraint):
Accept dodgy performance (2x slower is not fatal, but no prizes either).
Ask the user to supply a batch-oriented closure, so acting on an iterator over a small batch of data. This exposes a bit much of the internals of the iterator (the data are compressed nicely, and the user needs to know how to unwrap them, or the iterator needs to stage an unwrapped batch in memory).
Make map_edges generic in a type implementing a hypothetical EdgeMapClosure trait, and ask the user to implement such a type for each closure they want to inline. Not tested, but I would guess this exposes distinct methods to LLVM, each of which get nicely inlined. Downside is that the user has to write their own closure (packing relevant state up, etc).
Horrible hacks, like make distinct methods map_edges0, map_edges1, ... . Or add a generic parameter the programmer can use to make the methods distinct, but which is otherwise ignored.
Non-solutions include "just use for pair in iterator.iter() { /* */ }"; this is prep work for a data/task-parallel platform, and I would like to be able to capture/move these closures to work threads rather than capturing the main thread's execution. Maybe the pattern I should be using is to write the above, put it in a lambda/closure, and ship it around instead?
In a perfect world, it would be great to have a pattern which causes each occurrence of map_edges in the source file to result in different specialized methods in the binary, without forcing the entire project to be optimized at some scary level. I'm coming out of an unpleasant relationship with managed languages and JITs where generics would be the only way (I know of) to get this to happen, but Rust and LLVM seem magical enough that I thought there might be a good way. How do Rust's iterators handle this to inline their closure bodies? Or don't they (they should!)?
It seems that the problem is resolved by Rust's new approach to closures outlined at
http://smallcultfollowing.com/babysteps/blog/2014/11/26/purging-proc/
In short, Option 3 above (make functions generic with respect to a new closure type) is now transparently implemented when you make an implementation generic using the new closure traits. Rust produces the type behind the scenes for you.

Resources