I'm just asking why Rust decided to use &str for string literals instead of String. Isn't it possible for Rust to just automatically convert a string literal to a String and put it on the heap instead of putting it into the stack?
To understand the reasoning, consider that Rust wants to be a systems programming language. In general, this means that it needs to be (among other things) (a) as efficient as possible and (b) give the programmer full control over allocations and deallocations of heap memory. One use case for Rust is for embedded programming where memory is very limited.
Therefore, Rust does not want to allocate heap memory where this is not strictly necessary. String literals are known at compile time and can be written into the ro.data section of an executable/library, so they don't consume stack or heap space.
Now, given that Rust does not want to allocate the values on the heap, it is basically forced to treat string literals as &str: Strings own their values and can be moved and dropped, but how do you drop a value that is in ro.data? You can't really do that, so &str is the perfect fit.
Furthermore, treating string literals as &str (or, more accurately &'static str) has all the advantages and none of the disadvantages. They can be used in multiple places, can be shared without worrying about using heap memory and never have to be deleted. Also, they can be converted to owned Strings at will, so having them available as String is always possible, but you only pay the cost when you need to.
To create a String, you have to:
reserve a place on the heap (allocate), and
copy the desired content from a read-only location to the freshly allocated area.
If a string literal like "foo" did both, every string would effectively be allocated twice: once inside the executable as the read-only string, and the other time on the heap. You simply couldn't just refer to the original read-only data stored in the executable.
&str literals give you access to the most efficient string data: the one present in the executable image on startup, put there by the compiler along with the instructions that make up the program. The data it points to is not stored on the stack, what is stack-allocated is just the pointer/size pair, as is the case with any Rust slice.
Making "foo" desugar into what is now spelled "foo".to_owned() would make it slower and less space-efficient, and would likely require another syntax to get a non-allocating &str. After all, you don't want x == "foo" to allocate a string just to throw it away immediately. Languages like Python alleviate this by making their strings immutable, which allows them to cache strings mentioned in the source code. In Rust mutating String is often the whole point of creating it, so that strategy wouldn't work.
Related
Coming to Rust from dynamic languages like Python, I'm not used to the programming pattern where you provide a function with a mutable reference to an empty data structure and that function populates it. A typical example is reading a file into a String:
let mut f = File::open("file.txt").unwrap();
let mut contents = String::new();
f.read_to_string(&mut contents).unwrap();
To my Python-accustomed eyes, an API where you just create an owned value within the function and move it out as a return value looks much more intuitive / ergonomic / what have you:
let mut f = File::open("file.txt").unwrap();
let contents = f.read_to_string().unwrap();
Since the Rust standard library takes the former road, I figure there must be a reason for that.
Is it always preferable to use the reference pattern? If so, why? (Performance reasons? What specifically?) If not, how do I spot the cases where it might be beneficial? Is it mostly useful when I want to return another value in addition to populating the result data structure (as in the first example above, where .read_to_string() returns the number of bytes read)? Why not use a tuple? Is it simply a matter of personal preference?
If read_to_string wanted to return an owned String, this means it would have to heap allocate a new String every time it was called. Also, because Read implementations don't always know how much data there is to be read, it would probably have to incrementally re-allocate the work-in-progress String multiple times. This also means every temporary String has to go back to the allocator to be destroyed.
This is wasteful. Rust is a system programming language. System programming languages abhor waste.
Instead, the caller is responsible for allocating and providing the buffer. If you only call read_to_string once, nothing changes. If you call it more than once, however, you can re-use the same buffer multiple times without the constant allocate/resize/deallocate cycle. Although it doesn't apply in this specific case, similar interfaces can be design to also support stack buffers, meaning in some cases you can avoid heap activity entirely.
Having the caller pass the buffer in is strictly more flexible than the alternative.
I have an easy question regarding Box<X>.
I understand what it does, it allocates X on the heap.
In C++ you use the new operator to allocate something on the heap so it can outlive the current scope (because if you create something on the stack it goes away at the end of the current block).
But reading Rust's documentation, it looks like you can create something on the stack and still return it taking advantage of the language's move semantics without having to resort to the heap.
Then it's not clear to me when to use Box<X> as opposed to simply X.
I just started reading about Rust so I apologize if I'm missing something obvious.
First of all: C++11 (and newer) has move semantics with rvalue references, too. So your question would also apply to C++. Keep in mind though, that C++'s move semantics are -- unlike Rust's ones -- highly unsafe.
Second: the word "move semantic" somehow hints the absence of a "copy", which is not true. Suppose you have a struct with 100 64-bit integers. If you would transfer an object of this struct via move semantics, those 100 integers will be copied (of course, the compiler's optimizer can often remove those copies, but anyway...). The advantage of move semantics comes to play when dealing with objects that deal with some kind of data on the heap (or pointers in general).
For example, take a look at Vec (similar to C++'s vector): the type itself only contains a pointer and two pointer-sized integer (ptr, len and cap). Those three times 64bit are still copied when the vector is moved, but the main data of the vector (which lives on the heap) is not touched.
That being said, let's discuss the main question: "Why to use Box at all?". There are actually many use cases:
Unsized types: some types (e.g. Trait-objects which also includes closures) are unsized, meaning their size is not known to the compiler. But the compiler has to know the size of each stack frame -- hence those unsized types cannot live on the stack.
Recursive data structures: think of a BinaryTreeNode struct. It saves two members named "left" and "right" of type... BinaryTreeNode? That won't work. So you can box both children so that the compiler knows the size of your struct.
Huge structs: think of the 100 integer struct mentioned above. If you don't want to copy it every time, you can allocate it on the heap (this happens pretty seldom).
There are cases where you can’t return X eg. if X is ?Sized (traits, non-compile-time-sized arrays, etc.). In those cases Box<X> will still work.
More and more compilers make use of immutable strings (because of string interning, but are there other reasons?). However, string buffers are much faster when concatenating strings. Is there any reason why not all compilers use string buffers internally instead of immutable strings?
Probably the biggest argument for immutability is its benefits for concurrency. There's no need to lock and protect an object if you know it will never change. As the cores in our multi-core processors multiply, this benefit becomes more and more compelling.
There are trade-offs, of course. As you mention, string buffers can out-perform the constant allocation of new strings in apps that do a lot of string manipulation. Luckily, most languages include a string buffer tucked away in a library. By default, immutable strings are safer. In some cases, they're faster. If you find they're not working for you, you can always swap in a buffer.
Some times you could want to avoid/minimize the garbage collector, so I want to be sure about how to do it.
I think that the next one is correct:
Declare variables at the beginning of the function.
To use array instead of slice.
Any more?
To minimize garbage collection in Go, you must minimize heap allocations. To minimize heap allocations, you must understand when allocations happen.
The following things always cause allocations (at least in the gc compiler as of Go 1):
Using the new built-in function
Using the make built-in function (except in a few unlikely corner cases)
Composite literals when the value type is a slice, map, or a struct with the & operator
Putting a value larger than a machine word into an interface. (For example, strings, slices, and some structs are larger than a machine word.)
Converting between string, []byte, and []rune
As of Go 1.3, the compiler special cases this expression to not allocate: m[string(b)], where m is a map and b is a []byte
Converting a non-constant integer value to a string
defer statements
go statements
Function literals that capture local variables
The following things can cause allocations, depending on the details:
Taking the address of a variable. Note that addresses can be taken implicitly. For example a.b() might take the address of a if a isn't a pointer and the b method has a pointer receiver type.
Using the append built-in function
Calling a variadic function or method
Slicing an array
Adding an element to a map
The list is intended to be complete and I'm reasonably confident in it, but am happy to consider additions or corrections.
If you're uncertain of where your allocations are happening, you can always profile as others suggested or look at the assembly produced by the compiler.
Avoiding garbage is relatively straight forward. You need to understand where the allocations are being made and see if you can avoid the allocation.
First, declaring variables at the beginning of a function will NOT help. The compiler does not know the difference. However, human's will know the difference and it will annoy them.
Use of an array instead of a slice will work, but that is because arrays (unless dereferenced) are put on the stack. Arrays have other issues such as the fact that they are passed by value (copied) between functions. Anything on the stack is "not garbage" since it will be freed when the function returns. Any pointer or slice that may escape the function is put on the heap which the garbage collector must deal with at some point.
The best thing you can do is avoid allocation. When you are done with large bits of data which you don't need, reuse them. This is the method used in the profiling tutorial on the Go blog. I suggest reading it.
Another example besides the one in the profiling tutorial: Lets say you have an slice of type []int named xs. You continually append to the []int until you reach a condition and then you reset it so you can start over. If you do xs = nil, you are now declaring the underlying array of the slice as garbage to be collected. Append will then reallocate xs the next time you use it. If instead you do xs = xs[:0], you are still resetting it but keeping the old array.
For the most part, trying to avoid creating garbage is premature optimization. For most of your code it does not matter. But you may find every once in a while a function which is called a great many times that allocates a lot each time it is run. Or a loop where you reallocate instead of reusing. I would wait until you see the bottle neck before going overboard.
Once I studied about the advantage of a string being immutable because of something to improve performace in memory.
Can anybody explain this to me? I can't find it on the Internet.
Immutability (for strings or other types) can have numerous advantages:
It makes it easier to reason about the code, since you can make assumptions about variables and arguments that you can't otherwise make.
It simplifies multithreaded programming since reading from a type that cannot change is always safe to do concurrently.
It allows for a reduction of memory usage by allowing identical values to be combined together and referenced from multiple locations. Both Java and C# perform string interning to reduce the memory cost of literal strings embedded in code.
It simplifies the design and implementation of certain algorithms (such as those employing backtracking or value-space partitioning) because previously computed state can be reused later.
Immutability is a foundational principle in many functional programming languages - it allows code to be viewed as a series of transformations from one representation to another, rather than a sequence of mutations.
Immutable strings also help avoid the temptation of using strings as buffers. Many defects in C/C++ programs relate to buffer overrun problems resulting from using naked character arrays to compose or modify string values. Treating strings as a mutable types encourages using types better suited for buffer manipulation (see StringBuilder in .NET or Java).
Consider the alternative. Java has no const qualifier. If String objects were mutable, then any method to which you pass a reference to a string could have the side-effect of modifying the string. Immutable strings eliminate the need for defensive copies, and reduce the risk of program error.
Immutable strings are cheap to copy, because you don't need to copy all the data - just copy a reference or pointer to the data.
Immutable classes of any kind are easier to work with in multiple threads, the only synchronization needed is for destruction.
Perhaps, my answer is outdated, but probably someone will found here a new information.
Why Java String is immutable and why it is good:
you can share a string between threads and be sure no one of them will change the string and confuse another thread
you don’t need a lock. Several threads can work with immutable string without conflicts
if you just received a string, you can be sure no one will change its value after that
you can have many string duplicates – they will be pointed to a single instance, to just one copy. This saves computer memory (RAM)
you can do substring without copying, – by creating a pointer to an existing string’s element. This is why Java substring operation implementation is so fast
immutable strings (objects) are much better suited to use them as key in hash-tables
a) Imagine StringPool facility without making string immutable , its not possible at all because in case of string pool one string object/literal e.g. "Test" has referenced by many reference variables , so if any one of them change the value others will be automatically gets affected i.e. lets say
String A = "Test" and String B = "Test"
Now String B called "Test".toUpperCase() which change the same object into "TEST" , so A will also be "TEST" which is not desirable.
b) Another reason of Why String is immutable in Java is to allow String to cache its hashcode , being immutable String in Java caches its hash code and do not calculate every time we call hashcode method of String, which makes it very fast as hashmap key.
Think of various strings sitting on a common pool. String variables then point to locations in the pool. If u copy a string variable, both the original and the copy shares the same characters. These efficiency of sharing outweighs the inefficiency of string editing by extracting substrings and concatenating.
Fundamentally, if one object or method wishes to pass information to another, there are a few ways it can do it:
It may give a reference to a mutable object which contains the information, and which the recipient promises never to modify.
It may give a reference to an object which contains the data, but whose content it doesn't care about.
It may store the information into a mutable object the intended data recipient knows about (generally one supplied by that data recipient).
It may return a reference to an immutable object containing the information.
Of these methods, #4 is by far the easiest. In many cases, mutable objects are easier to work with than immutable ones, but there's no easy way to share with "untrusted" code the information that's in a mutable object without having to first copy the information to something else. By contrast, information held in an immutable object to which one holds a reference may easily be shared by simply sharing a copy of that reference.