Does String::from_utf8_lossy() allocate memory?

Does String::from_utf8_lossy() allocate memory? - rust

I want to convert the first 10 bytes of an array to a string.
If I do String::from_utf8_lossy(), this will return &str.
Do I understand correctly that &str is the address of those 10 bytes and in fact the memory will be allocated only to create the link?

Quoting from the docs for String::from_utf8_lossy
This function returns a Cow<'a, str>. If our byte slice is invalid UTF-8, then we need to insert the replacement characters, which will change the size of the string, and hence, require a String. But if it's already valid UTF-8, we don't need a new allocation. This return type allows us to handle both cases.
So it doesn't return a &str, but rather Cow<str>, and only allocates if necessary to replace invalid bytes with "�".
In general, though, if a function actually returns &str, that &str won't be (newly) allocated. It'll either be static (embedded in the binary itself) or will have a lifetime derived from some argument to the function (e.g. String::trim).

Related

Does std::ptr::write transfer the "uninitialized-ness" of the bytes it writes?

I'm working on a library that help transact types that fit in a pointer-size int over FFI boundaries. Suppose I have a struct like this:
use std::mem::{size_of, align_of};
struct PaddingDemo {
data: u8,
force_pad: [usize; 0]
}
assert_eq!(size_of::<PaddingDemo>(), size_of::<usize>());
assert_eq!(align_of::<PaddingDemo>(), align_of::<usize>());
This struct has 1 data byte and 7 padding bytes. I want to pack an instance of this struct into a usize and then unpack it on the other side of an FFI boundary. Because this library is generic, I'm using MaybeUninit and ptr::write:
use std::ptr;
use std::mem::MaybeUninit;
let data = PaddingDemo { data: 12, force_pad: [] };
// In order to ensure all the bytes are initialized,
// zero-initialize the buffer
let mut packed: MaybeUninit<usize> = MaybeUninit::zeroed();
let ptr = packed.as_mut_ptr() as *mut PaddingDemo;
let packed_int = unsafe {
std::ptr::write(ptr, data);
packed.assume_init()
};
// Attempt to trigger UB in Miri by reading the
// possibly uninitialized bytes
let copied = unsafe { ptr::read(&packed_int) };
Does that assume_init call triggered undefined behavior? In other words, when the ptr::write copies the struct into the buffer, does it copy the uninitialized-ness of the padding bytes, overwriting the initialized state as zero bytes?
Currently, when this or similar code is run in Miri, it doesn't detect any Undefined Behavior. However, per the discussion about this issue on github, ptr::write is supposedly allowed to copy those padding bytes, and furthermore to copy their uninitialized-ness. Is that true? The docs for ptr::write don't talk about this at all, nor does the nomicon section on uninitialized memory.

Does that assume_init call triggered undefined behavior?
Yes. "Uninitialized" is just another value that a byte in the Rust Abstract Machine can have, next to the usual 0x00 - 0xFF. Let us write this special byte as 0xUU. (See this blog post for a bit more background on this subject.) 0xUU is preserved by copies just like any other possible value a byte can have is preserved by copies.
But the details are a bit more complicated.
There are two ways to copy data around in memory in Rust.
Unfortunately, the details for this are also not explicitly specified by the Rust language team, so what follows is my personal interpretation. I think what I am saying is uncontroversial unless marked otherwise, but of course that could be a wrong impression.
Untyped / byte-wise copy
In general, when a range of bytes is being copied, the source range just overwrites the target range -- so if the source range was "0x00 0xUU 0xUU 0xUU", then after the copy the target range will have that exact list of bytes.
This is what memcpy/memmove in C behave like (in my interpretation of the standard, which is not very clear here unfortunately). In Rust, ptr::copy{,_nonoverlapping} probably performs a byte-wise copy, but it's not actually precisely specified right now and some people might want to say it is typed as well. This was discussed a bit in this issue.
Typed copy
The alternative is a "typed copy", which is what happens on every normal assignment (=) and when passing values to/from a function. A typed copy interprets the source memory at some type T, and then "re-serializes" that value of type T into the target memory.
The key difference to a byte-wise copy is that information which is not relevant at the type T is lost. This is basically a complicated way of saying that a typed copy "forgets" padding, and effectively resets it to uninitialized. Compared to an untyped copy, a typed copy loses more information. Untyped copies preserve the underlying representation, typed copies just preserve the represented value.
So even when you transmute 0usize to PaddingDemo, a typed copy of that value can reset this to "0x00 0xUU 0xUU 0xUU" (or any other possible bytes for the padding) -- assuming data sits at offset 0, which is not guaranteed (add #[repr(C)] if you want that guarantee).
In your case, ptr::write takes an argument of type PaddingDemo, and the argument is passed via a typed copy. So already at that point, the padding bytes may change arbitrarily, in particular they may become 0xUU.
Uninitialized usize
Whether your code has UB then depends on yet another factor, namely whether having an uninitialized byte in a usize is UB. The question is, does a (partially) uninitialized range of memory represent some integer? Currently, it does not and thus there is UB. However, whether that should be the case is heavily debated and it seems likely that we will eventually permit it.
Many other details are still unclear, though -- for example, transmuting "0x00 0xUU 0xUU 0xUU" to an integer may well result in a fully uninitialized integer, i.e., integers may not be able to preserve "partial initialization". To preserve partially initialized bytes in integers we would have to basically say that an integer has no abstract "value", it is just a sequence of (possibly uninitialized) bytes. This does not reflect how integers get used in operations like /. (Some of this also depends on LLVM decisions around poison and freeze; LLVM might decide that when doing a load at integer type, the result is fully poison if any input byte is poison.) So even if the code is not UB because we permit uninitialized integers, it may not behave as expected because the data you want to transfer is being lost.
If you want to transfer raw bytes around, I suggest to use a type suited for that, such as MaybeUninit. If you use an integer type, the goal should be to transfer integer values -- i.e., numbers.

Rust pointer being freed was not allocated error

Here's the situation, I want to do some data conversion from a string, and for convenience, I converted it to a pointer in the middle, and now I want to return the part of the string, but I'm stuck with this exception:
foo(74363,0x10fd2fdc0) malloc: *** error for object 0x7ff65ff000d1: pointer being freed was not allocated
foo(74363,0x10fd2fdc0) malloc: *** set a breakpoint in malloc_error_break to debug
When I try to debug the program, I got the error message as shown above.
Here's my sample code:
fn main() {
unsafe {
let mut s = String::from_utf8_unchecked(vec![97, 98]);
let p = s.as_ptr();
let k = p.add(1);
String::from_raw_parts(k as *mut u8, 1, 1);
}
}

You should never use an unsafe function without understanding its documentation, 100%.
So, what does String::from_raw_parts says:
Safety
This is highly unsafe, due to the number of invariants that aren't
checked:
The memory at ptr needs to have been previously allocated by the same allocator the standard library uses, with a required alignment of exactly 1.
length needs to be less than or equal to capacity.
capacity needs to be the correct value.
Violating these may cause problems like corrupting the allocator's internal data structures.
The ownership of ptr is effectively transferred to the String which may then deallocate, reallocate or change the contents of memory pointed to by the pointer at will. Ensure that nothing else uses the pointer after calling this function.
There are two things that stand out here:
The memory at ptr needs to have been previously allocated.
capacity needs to be the correct value.
And those are related to how allocations work in Rust. Essentially, deallocation only expects the very pointer value (and type) that allocation returned.
Shenanigans such as trying to deallocate a pointer pointing in the middle of an allocation, with a different alignment, or with a different size, are Not Allowed.
Furthermore, you also missed:
Ensure that nothing else uses the pointer after calling this function.
Here, the original instance of String is still owning the allocation, and you are trying to deallocate one byte out of it. It cannot ever go well.

Does Iterator::collect allocate the same amount of memory as String::with_capacity?

In C++ when joining a bunch of strings (where each element's size is known roughly), it's common to pre-allocate memory to avoid multiple re-allocations and moves:
std::vector<std::string> words;
constexpr size_t APPROX_SIZE = 20;
std::string phrase;
phrase.reserve((words.size() + 5) * APPROX_SIZE); // <-- avoid multiple allocations
for (const auto &w : words)
phrase.append(w);
Similarly, I did this in Rust (this chunk needs the unicode-segmentation crate)
fn reverse(input: &str) -> String {
let mut result = String::with_capacity(input.len());
for gc in input.graphemes(true /*extended*/).rev() {
result.push_str(gc)
}
result
}
I was told that the idiomatic way of doing it is a single expression
fn reverse(input: &str) -> String {
input
.graphemes(true /*extended*/)
.rev()
.collect::<Vec<&str>>()
.concat()
}
While I really like it and want to use it, from a memory allocation point of view, would the former allocate less chunks than the latter?
I disassembled this with cargo rustc --release -- --emit asm -C "llvm-args=-x86-asm-syntax=intel" but it doesn't have source code interspersed, so I'm at a loss.

Your original code is fine and I do not recommend changing it.
The original version allocates once: inside String::with_capacity.
The second version allocates at least twice: first, it creates a Vec<&str> and grows it by pushing &strs onto it. Then, it counts the total size of all the &strs and creates a new String with the correct size. (The code for this is in the join_generic_copy method in str.rs.) This is bad for several reasons:
It allocates unnecessarily, obviously.
Grapheme clusters can be arbitrarily large, so the intermediate Vec can't be usefully sized in advance -- it just starts at size 1 and grows from there.
For typical strings, it allocates way more space than would actually be needed just to store the end result, because &str is usually 16 bytes in size while a UTF-8 grapheme cluster is typically much less than that.
It wastes time iterating over the intermediate Vec to get the final size where you could just take it from the original &str.
On top of all this, I wouldn't even consider this version idiomatic, because it collects into a temporary Vec in order to iterate over it, instead of just collecting the original iterator, as you had in an earlier version of your answer. This version fixes problem #3 and makes #4 irrelevant but doesn't satisfactorily address #2:
input.graphemes(true).rev().collect()
collect uses FromIterator for String, which will try to use the lower bound of the size_hint from the Iterator implementation for Graphemes. However, as I mentioned earlier, extended grapheme clusters can be arbitrarily long, so the lower bound can't be any greater than 1. Worse, &strs may be empty, so FromIterator<&str> for String doesn't know anything about the size of the result in bytes. This code just creates an empty String and calls push_str on it repeatedly.
Which, to be clear, is not bad! String has a growth strategy that guarantees amortized O(1) insertion, so if you have mostly tiny strings that won't need to be reallocated often, or you don't believe the cost of allocation is a bottleneck, using collect::<String>() here may be justified if you find it more readable and easier to reason about.
Let's go back to your original code.
let mut result = String::with_capacity(input.len());
for gc in input.graphemes(true).rev() {
result.push_str(gc);
}
This is idiomatic. collect is also idiomatic, but all collect does is basically the above, with a less accurate initial capacity. Since collect doesn't do what you want, it's not unidiomatic to write the code yourself.
There is a slightly more concise, iterator-y version that still makes only one allocation. Use the extend method, which is part of Extend<&str> for String:
fn reverse(input: &str) -> String {
let mut result = String::with_capacity(input.len());
result.extend(input.graphemes(true).rev());
result
}
I have a vague feeling that extend is nicer, but both of these are perfectly idiomatic ways of writing the same code. You should not rewrite it to use collect, unless you feel that expresses the intent better and you don't care about the extra allocation.
Related
Efficiency of flattening and collecting slices

How can I convert a Vec<T> into a C-friendly *mut T?

I have a Rust library that returns a u8 array to a C caller via FFI. The library also handles dropping the array after the client is done with it. The library has no state, so the client needs to own the array until it is passed back to the library for freeing.
Using box::from_raw and boxed::into_raw would be nice, but I couldn't manage to work out how to convert the array into the return type.

A Vec<T> is described by 3 values:
A pointer to its first element, that can be obtained with .as_mut_ptr()
A length, that can be obtained with .len()
A capacity, that can be obtained with .capacity()
In terms of a C array, the capacity is the size of memory allocated, while the length is the number of elements actually contained in the array. Both are counting in number of T. You normally would need to provide these 3 values to your C code.
If you want them to be equals, you can use .shrink_to_fit() on the vector to reduce its capacity as near as its size as possible depending on the allocator.
If you give back the ownership of the Vec<T> to your C code, don't forget to call std::mem::forget(v) on it once you have retrieved the 3 values described before, to avoid having its destructor running at the end of the function.
Afterwards, you can create back a Vec from these 3 values using from_raw_parts(..) like this:
let v = unsafe { Vec::<T>::from_raw_parts(ptr, length, capacity) };
and when its destructor will run the memory will be correctly freed. Be careful, the 3 values need to be correct for deallocation of memory to be correct. It's not very important for a Vec<u8>, but the destructor of Vec will run the destructor of all data it contains according to its length.

how string buffer helps in not wasting resources

so basically the method of adding and handling string in general is being replaced with other ways because it causes confusion and waste of resources.
i agree but i want to know what exactly causes this waste of resources...as said here.
'..and you’re running on an implementation that doesn’t have
sophisticated code for handling strings you can end up doing a lot of
wasted allocations...'
Link
how string buffer method avoids this wasteful...

My explanation of this is coming from a Java / .NET background, however the same logic applies.
1. You must learn the concept of mutable and immutable objects...
Objects like Int32 / Integer are mutable objects, meaning they can be changed in their current memory location after instantiation. This is because regardless of the value of the object, it's size in memory does not need to change.
Strings are immutable objects which means that once they are allocated they cannot be changed in their current memory location. This is because by nature a string can be of arbitrary length, and therefore, every time the string changes length, the system/runtime must find a new location in memory to store the string.
2. Concatenation vs. StringBuilder / StringBuffer
Since strings are immutable, every concatenation forces reallocation of memory. Lets assume the following example uses ASCII encoding (1 byte per char)
var message = "Hello World";
At this point, the system has allocated 11 bytes of memory to store your string.
message += "Hello Universe";
At this point, the system must allocate another 14 bytes to your original string. Your existing 11 bytes of memory can no longer store your new string!
Why "sophisticated code for handling strings" (StringBuffer / StringBuilder) helps you!
every time you append a string to the buffer/builder, it allocates the memory once, and keeps a pointer to that string in memory. The next time you allocate a string, it does it in a new location, without affecting the last one. Once you have finished building your string, the buffer/builder, concatenates everything in once pass into a single string, therefore your string allocation is vastly reduced as you are not doing it every time you append something to your buffer/builder!
Example:
StringBuilder builder = new StringBuilder();
builder.Append("Hello World");
At this point the builder has allocated 11 bytes, and leaves that allocation as-is!
builder.Append("Hello Universe");
At this point, the builder allocated another 14 bytes, leaving the last string in tact.
builder.ToString();
At this point the builder concatenates all the strings in memory into one single string!
Summary:
Concatenation is a waste of resources because:
The system/runtime must clean out old, de-referenced memory locations, this takes some CPU time. In Java/.NET its called garbage collection.
Every re-allocation of memory is a waste, until the garbage collector can go and clean it out!
Therefore, concatenation reduces performance of CPU and memory usage!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string