Initializing a Vec in Rust is incredibly slow if compared with other languages. For example, the following code
let xs: Vec<u32> = vec![0u32, 1000000];
will translate to
let xs: Vec<u32> = Vec::new();
xs.push(0);
xs.push(0);
xs.push(0);
// ...
one million times. If you compare this to the following code in C:
uint32_t* xs = calloc(1000000, sizeof(uint32_t));
the difference is striking.
I had a little bit more luck with
let xs: Vec<u32> = Vec::with_capacity(1000000);
xs.resize(1000000, 0);
bit it's still very slow.
Is there any way to initialize a Vec faster?
You are actually performing different operations. In Rust, you are allocating an array of one million zeroes. In C, you are allocating an array of one million zero-length elements (i.e. non-existent elements). I don't believe this actually does anything, as DK commented on your question and pointed out.
Also, Running the code you presented verbatim gave me very comparable times on my laptop when optimizing, however this is probably because the vector allocation in Rust is being optimized away, as the variable is never used.
cargo build --release
time ../target/release/test
real 0.024s
usr 0.004s
sys 0.008s
and the C:
gcc -O3 test.c
time ./a.out
real 0.023s
usr 0.004s
sys 0.004s`
Without --release, the Rust performance drops, presumably because the allocation actually happens. Note that calloc() also looks to see if the memory is zeroed out first, and doesn't reset the memory if it is already set to zero. This makes the execution time of calloc() somewhat reliant on the previous state of your memory.
Related
Multiple structures in rust have shrink_to or shrink_to_fit methods, such as Vec and String. But apparently there's nothing like shrink_from_to.
Why would I want that?
Assume I've a XY gigabyte string or vector in memory and know the exact start and end positions of the part I am interesting in (which allocates only Z GB from start to end, somewhere in the middle). I could call truncate and then shrink_from_to effectivly freeing memory.
However I've still gigabytes of memory occupied by [0..start] which are of no relevance for my further processing.
Question
Is there any way to free this memory too without reallocating and copying the relevant parts there?
Note that shrink_to_fit does reallocate by copying into a smaller buffer and freeing the old buffer. Your best bet is probably just converting the slice you care about into an owned Vec and then dropping the original Vec.
fn main() {
let v1 = (0..1000).collect::<Vec<_>>(); // needs to be dropped
println!("{}", v1.capacity()); // 1000
let v2 = v1[100..150].to_owned(); // don't drop this!
println!("{}", v2.capacity()); // 50
drop(v1);
}
Is there any way to free this memory too without reallocating and copying the relevant parts there?
Move the segment you want to keep to the start of the collection (e.g. replace_range, drain, copy_within, rotate, ...), then truncate, then shrink.
APIs like realloc and mremap work in terms of "memory blocks" (aka allocations returned by malloc/mmap), they don't work in terms of random pointers. So a hypothetical shrink_from_to would just be doing that under the cover, since you can't really resize allocations from both ends.
I'm new to rust, and I am wondering what exactly happens when moving a variable.
struct Point {
x: i32,
y: i32,
}
fn main() {
let p = Point { x: 1, y: 1 };
let q = p;
}
When let q = p;, will the data (size of which is 8 bytes) be copied from a memory address to another? Since p is moved here thus it cannot be used anymore, I think it is good to make q's underlying memory address equal to p's. In another word, I think it is OK that nothing is copied in the machine code.
So my question is: will data be copied byte by byte when moving a variable? If it will, why?
[W]ill data be copied byte by byte when moving a variable?
In general, yes. To move a value, Rust simply performs a bitwise copy. If the value is not Copy, the source won't be used anymore after the move. If the value is Copy, both the source and the destination can be used.
However, there are many cases when the compiler backend can eliminate the copy by proving that the code beahves identical without the copy. This optimization happens completely in LLVM. In your example, the LLVM IR still contains the instructions to move the data, but the generated code does not contain the move even in debug mode.
If it will, why?
There are many reasons why the compiler can be unable to use the same memory for source and destination. In your example, with two variables in the same stack frame, it's easy to see that the move is not needed, but the code is a bit pointless anyway (though sometimes people do move values inside a function to make a variable immutable).
Here are just a few illustrations why the compiler may be unable to reuse the source memory for the destination:
The source value may be on the stack, while the destination is on the heap or vice versa. The statement let b = Box::new(3); will move the value 3 from the stack to the heap'; let i = *b; will move it from the heap back to the stack. It's still possible that the compiler can eliminate these moves, e.g. by writing the constant 3 to the heap immediately, without writing it to the stack first.
Source and destination may be on different stack frames, when moving values across functions – e.g. when passing a value into a function, or when returning a value from a function.
Source an destination values may be stored in struct fields, so they need to have the right offset inside the struct.
These are just a few examples. The takeaway is that in general, a move may result in a bitwise copy. Keep in mind that a bitwise copy is very cheap, though, and that the optimizer usually does a good job, so you should only worry about this if you actually have a proven performance bottleneck.
In C++ when joining a bunch of strings (where each element's size is known roughly), it's common to pre-allocate memory to avoid multiple re-allocations and moves:
std::vector<std::string> words;
constexpr size_t APPROX_SIZE = 20;
std::string phrase;
phrase.reserve((words.size() + 5) * APPROX_SIZE); // <-- avoid multiple allocations
for (const auto &w : words)
phrase.append(w);
Similarly, I did this in Rust (this chunk needs the unicode-segmentation crate)
fn reverse(input: &str) -> String {
let mut result = String::with_capacity(input.len());
for gc in input.graphemes(true /*extended*/).rev() {
result.push_str(gc)
}
result
}
I was told that the idiomatic way of doing it is a single expression
fn reverse(input: &str) -> String {
input
.graphemes(true /*extended*/)
.rev()
.collect::<Vec<&str>>()
.concat()
}
While I really like it and want to use it, from a memory allocation point of view, would the former allocate less chunks than the latter?
I disassembled this with cargo rustc --release -- --emit asm -C "llvm-args=-x86-asm-syntax=intel" but it doesn't have source code interspersed, so I'm at a loss.
Your original code is fine and I do not recommend changing it.
The original version allocates once: inside String::with_capacity.
The second version allocates at least twice: first, it creates a Vec<&str> and grows it by pushing &strs onto it. Then, it counts the total size of all the &strs and creates a new String with the correct size. (The code for this is in the join_generic_copy method in str.rs.) This is bad for several reasons:
It allocates unnecessarily, obviously.
Grapheme clusters can be arbitrarily large, so the intermediate Vec can't be usefully sized in advance -- it just starts at size 1 and grows from there.
For typical strings, it allocates way more space than would actually be needed just to store the end result, because &str is usually 16 bytes in size while a UTF-8 grapheme cluster is typically much less than that.
It wastes time iterating over the intermediate Vec to get the final size where you could just take it from the original &str.
On top of all this, I wouldn't even consider this version idiomatic, because it collects into a temporary Vec in order to iterate over it, instead of just collecting the original iterator, as you had in an earlier version of your answer. This version fixes problem #3 and makes #4 irrelevant but doesn't satisfactorily address #2:
input.graphemes(true).rev().collect()
collect uses FromIterator for String, which will try to use the lower bound of the size_hint from the Iterator implementation for Graphemes. However, as I mentioned earlier, extended grapheme clusters can be arbitrarily long, so the lower bound can't be any greater than 1. Worse, &strs may be empty, so FromIterator<&str> for String doesn't know anything about the size of the result in bytes. This code just creates an empty String and calls push_str on it repeatedly.
Which, to be clear, is not bad! String has a growth strategy that guarantees amortized O(1) insertion, so if you have mostly tiny strings that won't need to be reallocated often, or you don't believe the cost of allocation is a bottleneck, using collect::<String>() here may be justified if you find it more readable and easier to reason about.
Let's go back to your original code.
let mut result = String::with_capacity(input.len());
for gc in input.graphemes(true).rev() {
result.push_str(gc);
}
This is idiomatic. collect is also idiomatic, but all collect does is basically the above, with a less accurate initial capacity. Since collect doesn't do what you want, it's not unidiomatic to write the code yourself.
There is a slightly more concise, iterator-y version that still makes only one allocation. Use the extend method, which is part of Extend<&str> for String:
fn reverse(input: &str) -> String {
let mut result = String::with_capacity(input.len());
result.extend(input.graphemes(true).rev());
result
}
I have a vague feeling that extend is nicer, but both of these are perfectly idiomatic ways of writing the same code. You should not rewrite it to use collect, unless you feel that expresses the intent better and you don't care about the extra allocation.
Related
Efficiency of flattening and collecting slices
I often see high number of cycles spent in GC when running GHC-compiled programs.
These numbers tend to be order of magnitude higher than my JVM experience suggests they should be. In particular, number of bytes "copied" by GC seems to be vastly larger than amounts of data I'm computing.
Is such difference between non- and strict languages fundamental?
tl;dr: Most of the stuff that the JVM does in stack frames, GHC does on the heap. If you wanted to compare GHC heap/GC stats with the JVM equivalent, you'd really need to account for some portion of the bytes/cycles the JVM spends pushing arguments on the stack or copying return values between stack frames.
Long version:
Languages targeting the JVM typically make use of its call stack. Each invoked method has an active stack frame that includes storage for the parameters passed to it, additional local variables, and temporary results, plus room for an "operand stack" used for passing arguments to and receiving results from other methods it calls.
As a simple example, if the Haskell code:
bar :: Int -> Int -> Int
bar a b = a * b
foo :: Int -> Int -> Int -> Int
foo x y z = let u = bar y z in x + u
were compiled to JVM, the byte code would probably look something like:
public static int bar(int, int);
Code:
stack=2, locals=2, args_size=2
0: iload_0 // push a
1: iload_1 // push b
2: imul // multiply and push result
3: ireturn // pop result and return it
public static int foo(int, int, int);
Code:
stack=2, locals=4, args_size=3
0: iload_1 // push y
1: iload_2 // push z
2: invokestatic bar // call bar, pushing result
5: istore_3 // pop and save to "u"
6: iload_0 // push x
7: iload_3 // push u
8: iadd // add and push result
9: ireturn // pop result and return it
Note that calls to built-in primitives like imul and user-defined methods like bar involve copying/pushing the parameter values from local storage to the operand stack (using iload instructions) and then invoking the primitive or method. Return values then need to be saved/popped to local storage (with istore) or returned to the caller with ireturn; occasionally, a return value can be left on the stack to serve as an operand for another method invocation. Also, while it's not explicit in the byte code, the ireturn instruction involves a copy, from the callee's operand stack to the caller's operand stack. Of course, in actual JVM implementations, various optimizations are presumably possible to reduce copying.
When something else eventually calls foo to produce a computation, for example:
some_caller t = foo (1+3) (2+4) t + 1
the (unoptimized) code might look like:
iconst_1
iconst_3
iadd // put 1+3 on the stack
iconst_2
iconst_4
iadd // put 2+4 on the stack
iload_0 // put t on the stack
invokestatic foo
iconst 1
iadd
ireturn
Again, subexpressions are evaluated with a lot of pushing and popping on the operand stack. Eventually, foo is invoked with its arguments pushed on the stack and its result popped off for further processing.
All of this allocation and copying takes place on this stack, so there's no heap allocation involved in this example.
Now, what happens if that same code is compiled with GHC 8.6.4 (without optimization and on an x86_64 architecture for the sake of concreteness)? Well, the pseudocode for the generated assembly is something like:
foo [x, y, z] =
u = new THUNK(sat_u) // thunk, 32 bytes on heap
jump: (+) x u
sat_u [] = // saturated closure for "bar y z"
push UPDATE(sat_u) // update frame, 16 bytes on stack
jump: bar y z
bar [a, b] =
jump: (*) a b
The calls/jumps to the (+) and (*) "primitives" are actually more complicated than I've made them out to be because of the typeclass that's involved. For example, the jump to (+) looks more like:
push CONTINUATION(\f -> f x u) // continuation, 24 bytes on stack
jump: (+) dNumInt // get the right (+) from typeclass instance
If you turn on -O2, GHC optimizes away this more complicated call, but it also optimizes away everything else that's interesting about this example, so for the sake of argument, let's pretend the pseudocode above is accurate.
Again, foo isn't of much use until someone calls it. For the some_caller example above, the portion of code that calls foo will look something like:
some_caller [t] =
...
foocall = new THUNK(sat_foocall) // thunk, 24 bytes on heap
...
sat_foocall [] = // saturated closure for "foo (1+3) (2+4) t"
...
v = new THUNK(sat_v) // thunk "1+3", 16 bytes on heap
w = new THUNK(sat_w) // thunk "2+4", 16 bytes on heap
push UPDATE(sat_foocall) // update frame, 16 bytes on stack
jump: foo sat_v sat_w t
sat_v [] = ...
sat_w [] = ...
Note that nearly all of this allocation and copying takes place on the heap, rather than the stack.
Now, let's compare these two approaches. At first blush, it looks like the culprit really is lazy evaluation. We're creating these thunks all over the place that wouldn't be necessary if evaluation was strict, right? But let's look at one of these thunks more carefully. Consider the thunk for sat_u in the definition of foo. It's 32 bytes / 4 words with the following contents:
// THUNK(sat_u)
word 0: ptr to sat_u info table/code
1: space for return value
// variables we closed over:
2: ptr to "y"
3: ptr to "z"
The creation of this thunk isn't fundamentally different than the JVM code:
0: iload_1 // push y
1: iload_2 // push z
2: invokestatic bar // call bar, pushing result
5: istore_3 // pop and save to "u"
Instead of pushing y and z onto the operand stack, we loaded them into a heap-allocated thunk. Instead of popping the result off the operand stack into our stack frame's local storage and managing stack frames and return addresses, we left space for the result in the thunk and pushed a 16-byte update frame onto the stack before transferring control to bar.
Similarly, in the call to foo in some_caller, instead of evaluating the argument subexpressions by pushing constants on the stack and invoking primitives to push results on the stack, we created thunks on the heap, each of which included a pointer to info table / code for invoking primitives on those arguments and space for the return value; an update frame replaced the stack bookkeeping and result copying implicit in the JVM version.
Ultimately, thunks and update frames are GHC's replacement for stack-based parameter and result passing, local variables, and temporary workspace. A lot of activity that takes place in JVM stack frames takes place in the GHC heap.
Now, most of the stuff in JVM stack frames and on the GHC heap quickly becomes garbage. The main difference is that in the JVM, stack frames are automatically tossed out when a function returns, after the runtime has copied the important stuff out (e.g., return values). In GHC, the heap needs to be garbage collected. As others have noted, the GHC runtime is built around the idea that the vast majority of heap objects will immediately become garbage: a fast bump allocator is used for initial heap object allocation, and instead of copying out the important stuff every time a function returns (as for the JVM), the garbage collector copies it out when the bump heap gets kind of full.
Obviously, the above toy example is ridiculous. In particular, things are going to get much more complicated when we start talking about code that operates on Java objects and Haskell ADTs, rather than Ints. However, it serves to illustrate the point that a direct comparison of heap usage and GC cycles between GHC and JVM doesn't make a whole lot of sense. Certainly, an exact accounting doesn't really seem possible as the JVM and GHC approaches are too fundamentally different, and the proof would be in real-world performance. At the very least, an apples-to-apples comparison of GHC heap usage and GC stats needs to account for some portion of the cycles the JVM spends pushing, popping, and copying values between operand stacks. In particular, at least some fraction of JVM return instructions should count towards GHC's "bytes copied".
As for the contribution of "laziness" to heap usage (and heap "garbage" in particular), it seems hard to isolate. Thunks really play a dual role as a replacement for stack-based operand passing and as a mechanism for deferred evaluation. Certainly a switch from laziness to strictness can reduce garbage -- instead of first creating a thunk and then eventually evaluating it to another closure (e.g., a constructor), you can just create the evaluated closure directly -- but that just means that instead of your simple program allocating a mind-blowing 172 gigabytes on the heap, maybe the strict version "only" allocates a modest 84 gigabytes.
As far as I can see, the specific contribution of lazy evaluation to "bytes copied" should be minimal -- if a closure is important at GC time, it will need to be copied. If it's still an unevaluated thunk, the thunk will be copied. If it's been evaluated, just the final closure will need to be copied. If anything, since thunks for complicated structures are much smaller than their evaluated versions, laziness should typically reduce bytes copied. Instead, the usual big win with strictness is that it allows certain heap objects (or stack objects) to become garbage faster so we don't end up with space leaks.
No, laziness does not inherently lead to a large amount of copying in GC. The programmer's failure to manage laziness properly, however, can certainly do so. For example, if a persistent data structure ends up full of chains of thunks due to lazy modification, then it will end up badly bloated.
Another major issue you may be encountering, as Daniel Wagner mentioned, is the cost of immutability. While it is certainly possible to program with mutable structures in Haskell, it is much more idiomatic to work with immutable ones when possible. Immutable structure designs have various trade-offs. For example, ones designed for high performance when used persistently tend to have low branching factors to increase sharing, which leads to some bloat when they're used ephemerally.
I am guessing (hoping) the answer is never.
That such memory must be explicitly freed.
For example if if I wrote:
julia> x = Libc.malloc(1_000_000)
Ptr{Void} #0x0000000002f6bd80
julia> x = nothing
have I just leaked ~1MB of memory?
However I am not 100% certain this is true,
because the docs don't mention it at all.
help?> Libc.malloc(3)
malloc(size::Integer) -> Ptr{Void}
Call malloc from the C standard library.
Yes, you are correct.
Julia is designed to seamlessly interoperate with C on a low level, so when you use the C wrapper libraries, you you get C semantics and no garbage collection.
The docs for Libc.malloc is not written to teach C, but could be improved to mention Libc.free, in case anyone gets confused.
Yet one more answer
Yes you leaked 1MB of memory. But there's a mechanism that implements ownership transfer
struct MyStruct
...
end
n = 10
x = Base.Libc.malloc(n * sizeof(MyStruct)) # returns Ptr{Nothing}
xtyped = convert(Ptr{MyStruct}, x) # something like reinterpret cast
vector = unsafe_wrap(Array, xtyped, n; own = true) # returns Vector{MyStruct}
N.B. The last line transfers ownership of memory to Julia, hence, from this moment it's better to avoid using of x and xtyped as they can point to already freed memory.
Such low-level kung fu can prove helpful while dealing with binary files especially with function unsafe_read.
Alternatively, as it was mentioned you can use Base.Libc.free(x) to manually free up memory.
P.S. However it is often better to rely on built-in memory management. By default immutable structs are tried to be allocated on stack, which improves performance.