This question already has an answer here:
What are the differences between [], &[], and vec![]?
(1 answer)
Closed 12 months ago.
What is the difference between [u8] and Vec<u8> on rust?
[u8] represents an unsized contiguous sequence of u8 somewhere in memory. As an "unsized" type, you can't store it in variables nor pass it to functions, so it's not very useful on its own. Its primarily use is to create slice references, smart pointers, and in generic types.
&[u8] is a "slice reference" which refers to such a sequence, and also carries information about its length. The reference is represented by a "fat pointer" two machine words wide, consisting of pointer to the data and the length of the data. It's the basis for &str.
Box<[u8]> is like &[u8], except it owns the underlying memory, i.e. the sequence is heap-allocated by the Box::new() constructor and deallocated on Drop. Otherwise it is also two machine words wide. It's the basis for Box<str>.
Vec<u8> is like Box<[u8]>, except it additionally stores a "capacity" count, making it three machine words wide. Separately stored capacity allows for efficient resizing of the underlying sequence. It's the basis for String.
Related
I am completely new to Rust (as in I just started looking at it yesterday), and am working my way through "The Rust Programming Language". I'm a little stuck on Chapters 4.2 (References and Borrowing) / 4.3 (The Slice Type) and am trying to solidify my initial understanding of references before I move on. I'm an experienced programmer whose background is mainly in C++ (I am intimately familiar with several languages, but C++ is what I'm most comfortable with).
Consider the following Rust code:
let string_obj: String = String::from("My String");
let string_ref: &String = &string_obj;
let string_slice: &str = &string_obj[1..=5];
Based on my understanding, from the first line, string_obj is an object of type String that is stored on the stack, which contains three fields: (1) a pointer to the text "My String", allocated on the heap, encoded in UTF-8; (2) A length field with value 9; (3) A capacity field with a value >= 9. That's straightforward enough.
From the second line, string_ref is an immutable reference to a String object, also stored on the stack, which contains a single field - a pointer to string_obj. This leads me to believe that (leaving aside ownership rules, semantics, and other things I am yet to learn about references), a reference is essentially a pointer to some other object. Again, pretty straightforward.
It's the third line which causing me some headaches. From the documentation, it would appear that string_slice is an object of type &str that is stored on the stack, and contains two fields: 1) a pointer to the text "y Str", within the text "My String" associated with string_obj. 2) A length field with value 5.
But, by appearances at least, the &str type is by definition an immutable reference to an object of type str. So my questions are as follows:
What exactly is an str, and how is it represented in memory?
How does &str - a reference type, which I thought was simply a pointer - contain TWO fields (a pointer AND a length)?
How does Rust know in general what / how many fields to create when constructing a reference? (and consequently how does the programmer know?)
Slices are primitive types in Rust, which means that they don't necessarily have to follow the syntax rules of other types. In this case, str and &str are special and are treated with a bit of magic.
The type str doesn't really exist, since you can't have a slice that owns its contents. The reason for requiring us to spell this type "&str" is syntactic: the & reminds us that we're working with data borrowed from somewhere else, and it's required to be able to specify lifetimes, such as:
fn example<'a>(x: &String, y: &'a String) -> &'a str {
&y[..]
}
It's also necessary so that we can differentiate between an immutably-borrowed string slice (&str) and a mutably-borrowed string slice (&mut str). (Though the latter are somewhat limited in their usefulness and so you don't see them that often.)
Note that the same thing applies to array slices. We have arrays like [u8; 16] and we have slices like &[u8] but we don't really directly interact with [u8]. Here the mutable variant (&mut [u8]) is more useful than with string slices.
What exactly is an str, and how is it represented in memory?
As per above, str kind-of doesn't really exist by itself. The layout of &str though is as you suspect -- a pointer and a length.
(str is the actual characters referred to by the slice, and is a so-called dynamically-sized type. In the general case, a &T can't exist without a T to refer to. In this case it's a bit backwards in that the str doesn't exist without the &str slice.)
How does &str - a reference type, which I thought was simply a pointer - contain TWO fields (a pointer AND a length)?
As a primitive, it's a special case handled by the compiler.
How does Rust know in general what / how many fields to create when constructing a reference? (and consequently how does the programmer know?)
If it's a non-slice reference, then it's either a pointer or it's nothing (if the reference itself can be optimized away).
This question already has an answer here:
Why does cloned() allow this function to compile
(1 answer)
Closed 6 months ago.
Suppose we have a vector of some type that can be cloned
let foo_vec = vec![clonable_item_1, clonable_item_2, ...];
How to determine whether to use .clone() and .cloned() when iterating?
foo_vec.iter().cloned()...
// vs
foo_vec.clone().iter()...
I couldn't find anything written about the difference between the two. What's the difference?
They're not at all equal. If anything, it should be v.iter().cloned() vs. v.clone().into_iter(), both produce an iterator over owned T while v.clone().iter() produces an iterator over &T.
v.clone().into_iter() clones the Vec, allocating a Vec with the same size and cloning all elements into it, then converts this newly created Vec into an iterator. v.iter().cloned(), OTOH, creates a borrowed iterator over the Vec that yields &T, then applies the cloned() iterator adapter to it, which on-the-fly clones the &T produced by Iterator::next() to produce an owned T. Thus it doesn't need to allocate a new vector.
Because of that, you should always prefer v.iter().cloned() when possible (usually it is, but Vec's IntoIter has additional capabilities, like getting the underlying slice that may be required).
I'm learning Rust and I'm trying to solve an advent of code challenge (day 9 2015).
I created a situation where I end up with a variable that has the type Vec<&&str> (note the double '&', it's not a typo). I'm now wondering if this type is different than Vec<&str>. I can't figure out if a reference to a reference to something would ever make sense. I know I can avoid this situation by using String for the from and to variables. I'm asking if Vec<&&str> == Vec<&str> and if I should try and avoid Vec<&&str>.
Here is the code that triggered this question:
use itertools::Itertools
use std::collections::{HashSet};
fn main() {
let contents = fs::read_to_string("input.txt").unwrap();
let mut vertices: HashSet<&str> = HashSet::new();
for line in contents.lines() {
let data: Vec<&str> = line.split(" ").collect();
let from = data[0];
let to = data[2];
vertices.insert(from);
vertices.insert(to);
}
// `Vec<&&str>` originates from here
let permutations_iter = vertices.iter().permutations(vertices.len());
for perm in permutations_iter {
let length_trip = compute_length_of_trip(&perm);
}
}
fn compute_length_of_trip(trip: &Vec<&&str>) -> u32 {
...
}
Are Vec<&str> and Vec<&&str> different types?
I'm now wondering if this type is different than Vec<&str>.
Yes, a Vec<&&str> is a type different from Vec<&str> - you can't pass a Vec<&&str> where a Vec<&str> is expected and vice versa. Vec<&str> stores string slice references, which you can think of as pointers to data inside some strings. Vec<&&str> stores references to such string slice references, i.e. pointers to pointers to data. With the latter, accessing the string data requires an additional indirection.
However, Rust's auto-dereferencing makes it possible to use a Vec<&&str> much like you'd use a Vec<&str> - for example, v[0].len() will work just fine on either, v[some_idx].chars() will iterate over chars with either, and so on. The only difference is that Vec<&&str> stores the data more indirectly and therefore requires a bit more work on each access, which can lead to slightly less efficient code.
Note that you can always convert a Vec<&&str> to Vec<&str> - but since doing so requires allocating a new vector, if you decide you don't want Vec<&&str>, it's better not to create it in the first place.
Can I avoid Vec<&&str> and how?
Since a &str is Copy, you can avoid the creation of Vec<&&str> by adding a .copied() when you iterate over vertices, i.e. change vertices.iter() to vertices.iter().copied(). If you don't need vertices sticking around, you can also use vertices.into_iter(), which will give out &str, as well as free vertices vector as soon as the iteration is done.
The reason why the additional reference arises and the ways to avoid it have been covered on StackOverflow before.
Should I avoid Vec<&&str>?
There is nothing inherently wrong with Vec<&&str> that would require one to avoid it. In most code you'll never notice the difference in efficiency between Vec<&&str> and Vec<&str>. Having said that, there are some reasons to avoid it beyond performance in microbenchmarks. The additional indirection in Vec<&&str> requires the exact &strs it was created from (and not just the strings that own the data) to stick around and outlive the new collection. This is not relevant in your case, but would become noticeable if you wanted to return the permutations to the caller that owns the strings. Also, there is value in the simpler type that doesn't accumulate a reference on each transformation. Just imagine needing to transform the Vec<&&str> further into a new vector - you wouldn't want to deal with Vec<&&&str>, and so on for every new transformation.
Regarding performance, less indirection is usually better since it avoids an extra memory access and increases data locality. However, one should also note that a Vec<&str> takes up 16 bytes per element (on 64-bit architectures) because a slice reference is represented by a "fat pointer", i.e. a pointer/length pair. A Vec<&&str> (as well as Vec<&&&str> etc.) on the other hand takes up only 8 bytes per element, because a reference to a fat reference is represented by a regular "thin" pointer. So if your vector measures millions of elements, a Vec<&&str> might be more efficient than Vec<&str> simply because it occupies less memory. As always, if in doubt, measure.
The reason you have &&str is that the data &str is owned by vertices and when you create an interator over that data you are simply getting a reference to that data, hence the &&str.
There's really nothing to avoid here. It simply shows your iterator references the data that is inside the HashSet.
If String is actually
pub struct String {
vec: Vec<u8>,
}
Then why is there a special syntax (&str) for a slice of a Vec<u8>? In Chapter 3 of "Programming Rust" by Jim Blandy & Jason Orendorff it says,
&str is very much like &[T]: a fat pointer to some data. String is analogous to Vec<T>
Following that statement there is a chart which shows all the ways they're similar, but there isn't any mention of a single method that they're different. Is a &str; just a &[T]?
Likewise in the answer to, What are the differences between Rust's String and str? it says
This is identical to the relationship between a vector Vec<T> and a slice &[T], and is similar to the relationship between by-value T and by-reference &T for general types.
That question focuses on the difference between String and &str. Knowing that a String really is a vector of u8, I'm more interested in &str, which I can't even find the source to. Why does this primitive even exist when we have a primitive (implemented as a fat pointer) for regular vector slices?
It exists for the same reason that String exists, and we don't just pass around Vec<u8> for every string.
A String is an owned, growable container of data that is guaranteed to be UTF-8.
&str is a borrowed, fixed-length container of data that is guaranteed to be UTF-8
A Vec<u8> is an owned, growable container of u8.
&[u8] is a borrowed, fixed-length container of u8.
This is effectively the reason that types exist, period — to provide abstraction and guarantees (a.k.a. restrictions) on a looser blob of bits.
If we had access to the string as &mut [u8], then we could trivially ruin the UTF-8 guarantee, which is why all such methods are marked as unsafe. Even with an immutable &[u8], we wouldn't be able to make assumptions (a.k.a. optimizations) about the data and would have to write much more defensive code everywhere.
but there isn't any mention of a single method that they're different
Looking at the documentation for str and slice quickly shows a number of methods that exist on one that don't exist on the other, so I don't understand your statement. split_last is the first one that caught my eye, for example.
&str is not necessarily a view to a String, it can be a view to anything that is a valid UTF-8 string.
For example, the crate arraystring allows creating a string on the stack that can be viewed as a &str.
&[T] is confusing me.
I naively assumed that like &T, &[T] was a pointer, which is to say, a numeric pointer address.
However, I've seen some code like this, that I was rather surprised to see work fine (simplified for demonstration purposes; but you see code like this in many 'as_slice()' implementations):
extern crate core;
extern crate collections;
use self::collections::str::raw::from_utf8;
use self::core::raw::Slice;
use std::mem::transmute;
fn main() {
let val = "Hello World";
{
let value:&str;
{
let bytes = val.as_bytes();
let mut slice = Slice { data: &bytes[0] as *const u8, len: bytes.len() };
unsafe {
let array:&[u8] = transmute(slice);
value = from_utf8(array);
}
// slice.len = 0;
}
println!("{}", value);
}
}
So.
I initially thought that this was invalid code.
That is, the instance of Slice created inside the block scope is returned to outside the block scope (by transmute), and although the code runs, the println! is actually accessing data that is no longer valid through unsafe pointers. Bad!
...but that doesn't seem to be the case.
Consider commenting the line // slice.len = 0;
This code still runs fine (prints 'Hello World') when this happens.
So the line...
value = from_utf8(array);
If it was an invalid pointer to the 'slice' variable, the len at the println() statement would be 0, but it is not. So effectively a copy not just of a pointer value, but a full copy of the Slice structure.
Is that right?
Does that mean that in general its valid to return a &[T] as long as the actual inner data pointer is valid, regardless of the scope of the original &[T] that is being returned, because a &[T] assignment is a copy operation?
(This seems, to me, to be extremely counter intuitive... so perhaps I am misunderstanding; if I'm right, having two &[T] that point to the same data cannot be valid, because they won't sync lengths if you modify one...)
A slice &[T], as you have noticed, is "equivalent" to a structure std::raw::Slice. In fact, Slice is an internal representation of &[T] value, and yes, it is a pointer and a length of data behind that pointer. Sometimes such structure is called "fat pointer", that is, a pointer and an additional piece of information.
When you pass &[T] value around, you indeed are just copying its contents - the pointer and the length.
If it was an invalid pointer to the 'slice' variable, the len at the println() statement would be 0, but it is not. So effectively a copy not just of a pointer value, but a full copy of the Slice structure.
Is that right?
So, yes, exactly.
Does that mean that in general its valid to return a &[T] as long as the actual inner data pointer is valid, regardless of the scope of the original &[T] that is being returned, because a &[T] assignment is a copy operation?
And this is also true. That's the whole idea of borrowed references, including slices - borrowed references are statically checked to be used as long as their referent is alive. When DST finally lands, slices and regular references will be even more unified.
(This seems, to me, to be extremely counter intuitive... so perhaps I am misunderstanding; if I'm right, having two &[T] that point to the same data cannot be valid, because they won't sync lengths if you modify one...)
And this is actually an absolutely valid concern; it is one of the problems with aliasing. However, Rust is designed exactly to prevent such bugs. There are two things which render aliasing of slices valid.
First, slices can't change length; there are no methods defined on &[T] which would allow you changing its length in place. You can create a derived slice from a slice, but it will be a new object whatsoever.
But even if slices can't change length, if the data could be mutated through them, they still could bring disaster if aliased. For example, if values in slices are enum instances, mutating a value in such an aliased slice could make a pointer to internals of enum value contained in this slice invalid. So, second, Rust aliasable slices (&[T]) are immutable. You can't change values contained in them and you can't take mutable references into them.
These two features (and compiler checks for lifetimes) make aliasing of slices absolutely safe. However, sometimes you do need to modify the data in a slice. And then you need mutable slice, called &mut [T]. You can change your data through such slice; but these slices are not aliasable. You can't create two mutable slices into the same structure (an array, for example), so you can't do anything dangerous.
Note, however, that using transmute() to transform a slice into a Slice or vice versa is an unsafe operation. &[T] is guaranteed statically to be correct if you create it using right methods, like calling as_slice() on a Vec. However, creating it manually using Slice struct and then transmuting it into &[T] is error-prone and can easily segfault your program, for example, when you assign it more length than is actually allocated.