When returning a value from a function as such
let x = String::from("Hello");
let y = do_something(x);
with
fn do_something(s: String) -> String { s }
Does Rust do a shallow copy (ie copying the stack value of s into y), or does it do something else ? A shallow copy is made when passing, but is it the same behavior when returning it ?
A shallow copy is made when passing, but is it the same behavior when returning it?
Ignoring the first part (which is more complicated) and answering just what you want to know: yes, both parameter passing and returning have the same mechanism.
Now to the first part: that mechanism is called "moving", but it also matters whether we're talking about the language semantics (i.e. what the language spec describes as happening according to its abstract execution model) or what actually happens at runtime on a physical computer (which is allowed to be different, so long as the program still does the right thing according to the language spec).
In the language semantics, most types (i.e. those which do not implement Copy) are passed as parameters or returned by "moving", not "copying". The difference is that moving implies a change of ownership, whereas copying implies that the original owner retains ownership but the recipient receives ownership of a copy.
In actual reality, a "move" is likely to be implemented as a copy of the stack value (what you call a shallow copy). However, that is not always the case, as the compiler may optimise how it pleases. Particularly if a function is inlined, then there is no need to make any copies of parameters or a return value; the inlined function will be executed in the same stack frame, so the stack value need not be copied to a different frame. Note that Rust's ownership semantics are required for this optimisation to work: the inlined function might change the value it takes ownership of (if the parameter is declared mut), and that change would be visible to the outer function (since it's done in the same stack frame), if not for the fact that the outer function isn't allowed to use the value after giving up ownership of it.
Related
When writing a function how does one decide whether to make input parameters referenced or consumed?
For example, should I do this?
fn foo(val: Bar) -> bool { check(val) } // version 1
Or use referenced param instead?
fn foo(val: &Bar) -> bool { check(*val) } // version 2
On the client side, if I only had the second version but wanted to have my value consumed, I'd have to do something like this:
// given in: Bar
let out = foo(&in); // using version 2 but wanting to consume ownership
drop(in);
On the other hand, if I only had the first version but wanted to keep my reference, I'd have to do something like this:
// given in: &Bar
let out = foo(in.clone()); // using version 1 but wanting to keep reference alive
So which is preferred, and why?
Are there any performance considerations in making this choice? Or does the compiler make them equivalent in terms of performance, and how?
And when would you want to offer both versions (via traits)? And for those times how do you write the underlying implementations for both functions -- do you duplicate the logic in each method signature or do you have one proxy to the other? Which to which, and why?
Rust's goal is to have performance and syntax similar to C/C++ without the memory problems. To do this it avoids things like garbage collection and instead enforces a particular strict memory model of "ownership" and "borrowing". These are critical concepts in Rust. I would suggest reading Understanding Ownership in The Rust Book.
The rules of memory ownership are...
Each value in Rust has a variable that’s called its owner.
There can only be one owner at a time.
When the owner goes out of scope, the value will be dropped.
Enforcing a single owner avoids a great many bugs and complications typical of C and C++ programs while avoiding complex and slow memory management at runtime.
You can't get very far with only that, so Rust provides references. A reference lets functions safely "borrow" data without taking ownership. You can have either as many immutable references as you like, or only one mutable reference.
When applied to function calls, passing a value passes ownership to the function. Passing a reference is "borrowing", ownership is retained.
It's really, really important to understand ownership, borrowing, and later on, lifetimes. But here's some rules of thumb.
If your function needs to take ownership of the data, pass by value.
If your function only needs to read the data, pass a reference.
If your function needs to change the data, pass a mutable reference.
Note what's not in there: performance. Let the compiler take care of that.
Assuming check only reads data and checks that it's ok, it should take a reference. So your example would be...
fn foo(val: &Bar) -> bool { check(val) }
On the client side, if I only had the second version but wanted to have my value consumed...
There's no reason to want a function which takes a reference to do that. If it's the function's job to manage the memory, you pass it ownership. If it isn't, it's not its job to manage your memory.
There's also no need to manually call drop. You'd simply let the variable fall out of scope and it will be automatically dropped.
And when would you want to offer both versions (via traits)?
You wouldn't. If a function can take a reference there's no reason for it to take ownership.
If the function needs ownership, you should pass by value. If the function only needs a reference, you should pass by reference.
Passing by value fn foo(val: Bar) when it isn't necessary for the function to work could require the user to clone the value. Passing by reference is preferred in this case since a clone can be avoided.
Passing by reference fn foo(val: &Bar) when the function needs ownership would require it to either copy or clone the value. Pass by value is preferred in this case because it gives the user control whether an existing value's ownership is transferred or is cloned. The function doesn't have to make that decision and a clone can be avoided.
There are some exceptions, simple primitives like i32 can be passed-by-value without any performance penalty and may be more convenient.
And when would you want to offer both versions (via traits)?
You could use the Borrow trait:
fn foo<B: Borrow<Bar>>(val: B) -> bool {
check(val.borrow())
}
let b: Bar = ...;
foo(&b); // both of
foo(b); // these work
Coming to Rust from dynamic languages like Python, I'm not used to the programming pattern where you provide a function with a mutable reference to an empty data structure and that function populates it. A typical example is reading a file into a String:
let mut f = File::open("file.txt").unwrap();
let mut contents = String::new();
f.read_to_string(&mut contents).unwrap();
To my Python-accustomed eyes, an API where you just create an owned value within the function and move it out as a return value looks much more intuitive / ergonomic / what have you:
let mut f = File::open("file.txt").unwrap();
let contents = f.read_to_string().unwrap();
Since the Rust standard library takes the former road, I figure there must be a reason for that.
Is it always preferable to use the reference pattern? If so, why? (Performance reasons? What specifically?) If not, how do I spot the cases where it might be beneficial? Is it mostly useful when I want to return another value in addition to populating the result data structure (as in the first example above, where .read_to_string() returns the number of bytes read)? Why not use a tuple? Is it simply a matter of personal preference?
If read_to_string wanted to return an owned String, this means it would have to heap allocate a new String every time it was called. Also, because Read implementations don't always know how much data there is to be read, it would probably have to incrementally re-allocate the work-in-progress String multiple times. This also means every temporary String has to go back to the allocator to be destroyed.
This is wasteful. Rust is a system programming language. System programming languages abhor waste.
Instead, the caller is responsible for allocating and providing the buffer. If you only call read_to_string once, nothing changes. If you call it more than once, however, you can re-use the same buffer multiple times without the constant allocate/resize/deallocate cycle. Although it doesn't apply in this specific case, similar interfaces can be design to also support stack buffers, meaning in some cases you can avoid heap activity entirely.
Having the caller pass the buffer in is strictly more flexible than the alternative.
How can you test whether your function is getting [1,2,4,3] or l?
That might be useful to decide whether you want to return, for example, an ordered list or replace it in place.
For example, if it gets [1,2,4,3] it should return [1,2,3,4]. If it gets l, it should link the ordered list to l and do not return anything.
You can't tell the difference in any reasonable way; you could do terrible things with the gc module to count references, but that's not a reasonable way to do things. There is no difference between an anonymous object and a named variable (aside from the reference count), because it will be named no matter what when received by the function; "variables" aren't really a thing, Python has "names" which reference objects, with the object utterly unconcerned with whether it has named or unnamed references.
Make a consistent API. If you need to have it operate both ways, either have it do both things (mutate in place and return the mutated copy for completeness), or make two distinct APIs (one of which can be written in terms of the other, by having the mutating version used to implement the return new version by making a local copy of the argument, passing it to the mutating version, then returning the mutated local copy).
Pass-by-value semantics are easy to implement in an interpreter (for, say, your run-of-the-mill imperative language). For each scope, we maintain an environment that maps identifiers to their values. Processing a function call involves creating a new environment and populating it with copies of the arguments.
This won't work if we allow arguments that are passed by reference. How is this case typically handled?
First, your interpreter must check that the argument is something that can be passed by reference – that the argument is something that is legal in the left-hand side of an assignment statement. For example, if f has a single pass-by-reference parameter, f(x) is okay (since x := y makes sense) but f(1+1) is not (1+1 := y makes no sense). Typical qualifying arguments are variables and variable-like constructs like array indexing (if a is an array for which 5 is a legal index, f(a[5]) is okay, since a[5] = y makes sense).
If the argument passes that check, it will be possible for your interpreter to determine while processing the function call which precise memory location it refers to. When you construct the new environment, you put a reference to that memory location as the value of the pass-by-reference parameter. What that reference concretely looks like depends on the design of your interpreter, particularly on how you represent variables: you could simply use a pointer if your implementation language supports it, but it can be more complex if your design calls for it (the important thing is that the reference must make it possible for you to retrieve and modify the value contained in the memory location being referred to).
while your interpreter is interpreting the body of a function, it may have to treat pass-by-referece parameters specially, since the enviroment does not contain a proper value for it, just a reference. Your interpreter must recognize this and go look what the reference points to. For example, if x is a local variable and y is a pass-by-reference parameter, computing x+1 and y+1 may (depending on the details of your interpreter) work differently: in the former, you just look up the value of x, and then add one to it; in the latter, you must look up the reference that y happens to be bound to in the environment and go look what value is stored in the variable on the far side of the reference, and then you add one to it. Similarly, x = 1 and y = 1 are likely to work differently: the former just goes to modify the value of x, while the latter must first see where the reference points to and modify whatever variable or variable-like thing (such as an array element) it finds there.
You could simplify this by having all variables in the environment be bound to references instead of values; then looking up the value of a variable is the same process as looking up the value of a pass-by-reference parameter. However, this creates other issues, and it depends on your interpreter design and on the details of the language whether that's worth the hassle.
This article seems to imply the possibility that the use of the term "move" in the rust documentation doesn't mean copies, but transfer of ownership at compile time. See this quote specifically:
The compiler enforces that there is only a single owner. Assigning the pointer to a new location transfers ownership (known as a move for short). Consider this program:
Is this correct? are ownership transfers/moves not actually copies at runtime, but only a compile time abstraction.
No, a move is still a copy (in the sense of memcpy) although not necessarily of the whole data structure. However, it has the compile-time semantics that you list. That is,
let a = ~[1,2,3];
let b = a; // copies one word (the pointer), "moves" the data.
let c = b.clone(); // copies the data too.
(Note that I've used b.clone() rather than copy b, because Copy is being removed, and replaced by Clone, which is more powerful/flexible.)
This copying behaviour is required to happen, because (many) variables in Rust are definite chunks of memory (just like those in C/C++), and if something has a certain value, that value has to be in the appropriate location memory; This means moves (which usually involve transferring data from one variable to another) have to actually perform a copy.