Intro:
I'm curious about the performance difference (both cpu and memory usage) of storing small numbers as bitpacked unsigned integers versus vectors of bytes
Example
I'll use the example of storing RGBA values. They're 4 Bytes so it is very tempting to store them as a u32.
However, it would be more readable to store them as a vector of type u8.
As a more detailed example, say I want to store and retrieve the color rgba(255,0,0,255)
This is how I would go about doing the two methods:
// Bitpacked:
let i: u32 = 4278190335;
//binary is 11111111 00000000 00000000 11111111
//In reality I would most likely do something more similar to:
let i: u32 = 255 << 24 + 255; //i think this syntax is right
// Vector:
let v: Vec<u8> = [255,0,0,255];
Then the two red values could be queried with
i >> 24
//or
&v[0]
//both expressions evaluate to 255 (i think. I'm really new to rust <3 )
Question 1
As far as I know, the values of v must be stored on the heap and so there are the performance costs that are associated with that. Are these costs significant enough to make bit packing worth it?
Question 2
Then there's the two expressions i >> 24 and &v[0]. I don't know how fast rust is at bit shifting versus getting values off the heap. I'd test it but I won't have access to a machine with rust installed for a while. Are there any immediate insights someone could give on the drawbacks of these two operations?
Question 3
Finally, is the difference in memory usage as simple as just storing 32 bits on the stack for the u32 versus storing 64 bits on the stack for the pointer v as well as 32 bits on the heap for the values of v?
Sorry if this question is a bit confusing
Using a Vec will be more expensive; as you mentioned, it will need to perform heap allocations, and access will be bounds-checked as well.
That said, if you use an array [u8; 4] instead, the performance compared with a bitpacked u32 representation should be almost identical.
In fact, consider the following simple example:
pub fn get_red_bitpacked(i: u32) -> u8 {
(i >> 24) as u8
}
pub fn get_red_array(v: [u8; 4]) -> u8 {
v[3]
}
pub fn test_bits(colour: u8) -> u8 {
let colour = colour as u32;
let i = (colour << 24) + colour;
get_red_bitpacked(i)
}
pub fn test_arr(colour: u8) -> u8 {
let v = [colour, 0, 0, colour];
get_red_array(v)
}
I took a look on Compiler Explorer, and the compiler decided that get_red_bitpacked and get_red_array were completely identical: so much so it didn't even bother generating code for the former. The two "test" functions obviously optimised to the exact same assembly as well.
example::get_red_array:
mov eax, edi
shr eax, 24
ret
example::test_bits:
mov eax, edi
ret
example::test_arr:
mov eax, edi
ret
Obviously this example was seen through by the compiler: for a proper comparison you should benchmark with actual code. That said, I feel fairly safe in saying that with Rust the performance of u32 versus [u8; 4] for these kinds of operations should be identical in general.
tl;dr use a struct:
struct Color {
r: u8,
g: u8,
b: u8,
a: u8,
}
Maybe use repr(packed) as well.
It gives you the best of all worlds and you can give the channels their name.
Are these costs significant enough to make bit packing worth it?
Heap allocation has a huge cost.
Are there any immediate insights someone could give on the drawbacks of these two operations?
Both are noise compared to allocating memory.
let vector: Vec<u8> = Vec::new();
Can the vector in the code above cause a stackoverflow if the vector grows too big?
let vector: Vec<Box<u8>> = Vec::new();
How bout this one? Since its elements are on the heap.
let vector: Box<Vec<u8>> = Box::new(Vec::new());
I'm assuming that in the code above no stackoverflow should be possible, am i correct?
No the actual data is on heap. So there will not be stack overflow.
What is on stack is capacity, length and the pointer to the actual data on heap. If regrowth is required then it is done on the heap. If it is moved (not cloned) then what is copied is just the length, capacity and pointer to data (bitwise shallow copy).
Not the actual implementation but if you have to implement Vector then you will start with:
pub struct Vec<T> {
ptr: Unique<T>,
cap: usize,
len: usize,
}
You see the ptr is actually pointing to heap location where the data is. The vector on stack will consists of just few fields like the 3 mentioned above.
You cannot grow an object on stack, as you push objects on stack frame if any object is allowed to grow it will run over other objects. On heap for growth, if contiguous memory is not available, then entire data is moved to another place with newer capacity; if contiguous memory block is available, then growth in capacity is instant.
When I create a vector, the length and the capacity are the same. What is the difference between these methods?
fn main() {
let vec = vec![1, 2, 3, 4, 5];
println!("Length: {}", vec.len()); // Length: 5
println!("Capacity: {}", vec.capacity()); // Capacity: 5
}
Growable vectors reserve space for future additions, hence the difference between allocated space (capacity) and actually used space (length).
This is explained in the standard library's documentation for Vec:
The capacity of a vector is the amount of space allocated for any future elements that will be added onto the vector. This is not to be confused with the length of a vector, which specifies the number of actual elements within the vector. If a vector's length exceeds its capacity, its capacity will automatically be increased, but its elements will have to be reallocated.
For example, a vector with capacity 10 and length 0 would be an empty vector with space for 10 more elements. Pushing 10 or fewer elements onto the vector will not change its capacity or cause reallocation to occur. However, if the vector's length is increased to 11, it will have to reallocate, which can be slow. For this reason, it is recommended to use Vec::with_capacity whenever possible to specify how big the vector is expected to get.
len() returns the number of elements in the vector (i.e., the vector's length). In the example below, vec contains 5 elements, so len() returns 5.
capacity() returns the number of elements the vector can hold (without reallocating memory). In the example below, vec can hold 105 elements, since we use reserve() to allocate at least 100 slots in addition to the original 5 (more might be allocated in order to minimize the number of allocations).
fn main() {
let mut vec = vec![1, 2, 3, 4, 5];
vec.reserve(100);
assert!(vec.len() == 5);
assert!(vec.capacity() >= 105);
}
Are variables of type Vec<[f3; 5]> stored as one contiguous array (of Vec::len() * 5 * sizeof(f32) bytes) or is it stored as a Vec of pointers?
The contents of a Vec<T> is, regardless of T, a single heap allocation, of self.capacity() * std::mem::size_of::<T>() bytes. (Vec overallocates—that’s the whole point of Vec<T> instead of Box<[T]>—so it’s the capacity, not the length, that matter in this calculation.) The actual Vec<T> itself takes three words (24 bytes on a 64-bit machine).
[f32; 5] is just a chunk of memory containing five 32-bit floating-point numbers, with no indirection; this comes to twenty bytes (hence std::mem::size_of::<[f32; 5]>() == 20).
I have a vector of u8 that I want to interpret as a vector of u32. It is assumed that the bytes are in the right order. I don't want to allocate new memory and copy bytes after casting. I got the following to work:
use std::mem;
fn reinterpret(mut v: Vec<u8>) -> Option<Vec<u32>> {
let v_len = v.len();
v.shrink_to_fit();
if v_len % 4 != 0 {
None
} else {
let v_cap = v.capacity();
let v_ptr = v.as_mut_ptr();
println!("{:?}|{:?}|{:?}", v_len, v_cap, v_ptr);
let v_reinterpret = unsafe { Vec::from_raw_parts(v_ptr as *mut u32, v_len / 4, v_cap / 4) };
println!("{:?}|{:?}|{:?}",
v_reinterpret.len(),
v_reinterpret.capacity(),
v_reinterpret.as_ptr());
println!("{:?}", v_reinterpret);
println!("{:?}", v); // v is still alive, but is same as rebuilt
mem::forget(v);
Some(v_reinterpret)
}
}
fn main() {
let mut v: Vec<u8> = vec![1, 1, 1, 1, 1, 1, 1, 1];
let test = reinterpret(v);
println!("{:?}", test);
}
However, there's an obvious problem here. From the shrink_to_fit documentation:
It will drop down as close as possible to the length but the allocator may still inform the vector that there is space for a few more elements.
Does this mean that my capacity may still not be a multiple of the size of u32 after calling shrink_to_fit? If in from_raw_parts I set capacity to v_len/4 with v.capacity() not an exact multiple of 4, do I leak those 1-3 bytes, or will they go back into the memory pool because of mem::forget on v?
Is there any other problem I am overlooking here?
I think moving v into reinterpret guarantees that it's not accessible from that point on, so there's only one owner from the mem::forget(v) call onwards.
This is an old question, and it looks like it has a working solution in the comments. I've just written up what exactly goes wrong here, and some solutions that one might create/use in today's Rust.
This is undefined behavior
Vec::from_raw_parts is an unsafe function, and thus you must satisfy its invariants, or you invoke undefined behavior.
Quoting from the documentation for Vec::from_raw_parts:
ptr needs to have been previously allocated via String/Vec (at least, it's highly likely to be incorrect if it wasn't).
T needs to have the same size and alignment as what ptr was allocated with. (T having a less strict alignment is not sufficient, the alignment really needs to be equal to satsify the dealloc requirement that memory must be allocated and deallocated with the same layout.)
length needs to be less than or equal to capacity.
capacity needs to be the capacity that the pointer was allocated with.
So, to answer your question, if capacity is not equal to the capacity of the original vec, then you've broken this invariant. This gives you undefined behavior.
Note that the requirement isn't on size_of::<T>() * capacity either, though, which brings us to the next topic.
Is there any other problem I am overlooking here?
Three things.
First, the function as written is disregarding another requirement of from_raw_parts. Specifically, T must have the same size as alignment as the original T. u32 is four times as big as u8, so this again breaks this requirement. Even if capacity*size remains the same, size isn't, and capacity isn't. This function will never be sound as implemented.
Second, even if all of the above was valid, you've also ignored the alignment. u32 must be aligned to 4-byte boundaries, while a Vec<u8> is only guaranteed to be aligned to a 1-byte boundary.
A comment on the OP mentions:
I think on x86_64, misalignment will have performance penalty
It's worth noting that while this may be true of machine language, it is not true for Rust. The rust reference explicitly states "A value of alignment n must only be stored at an address that is a multiple of n." This is a hard requirement.
Why the exact type requirement?
Vec::from_raw_parts seems like it's pretty strict, and that's for a reason. In Rust, the allocator API operates not only on allocation size, but on a Layout, which is the combination of size, number of things, and alignment of individual elements. In C with memalloc, all the allocator can rely upon is that the size is the same, and some minimum alignment. In Rust, though, it's allowed to rely on the entire Layout, and invoke undefined behavior if not.
So in order to correctly deallocate the memory, Vec needs to know the exact type that it was allocated with. By converting a Vec<u32> into Vec<u8>, it no longer knows this information, and so it can no longer properly deallocate this memory.
Alternative - Transforming slices
Vec::from_raw_parts's strictness comes from the fact that it needs to deallocate the memory. If we create a borrowing slice, &[u32] instead, we no longer need to deal with it! There is no capacity when turning a &[u8] into &[u32], so we should be all good, right?
Well, almost. You still have to deal with alignment. Primitives are generally aligned to their size, so a [u8] is only guaranteed to be aligned to 1-byte boundaries, while [u32] must be aligned to a 4-byte boundary.
If you want to chance it, though, and create a [u32] if possible, there's a function for that - <[T]>::align_to:
pub unsafe fn align_to<U>(&self) -> (&[T], &[U], &[T])
This will trim of any starting and ending misaligned values, and then give you a slice in the middle of your new type. It's unsafe, but the only invariant you need to satisfy is that the elements in the middle slice are valid.
It's sound to reinterpret 4 u8 values as a u32 value, so we're good.
Putting it all together, a sound version of the original function would look like this. This operates on borrowed rather than owned values, but given that reinterpreting an owned Vec is instant-undefined-behavior in any case, I think it's safe to say this is the closest sound function:
use std::mem;
fn reinterpret(v: &[u8]) -> Option<&[u32]> {
let (trimmed_front, u32s, trimmed_back) = unsafe { v.align_to::<u32>() };
if trimmed_front.is_empty() && trimmed_back.is_empty() {
Some(u32s)
} else {
// either alignment % 4 != 0 or len % 4 != 0, so we can't do this op
None
}
}
fn main() {
let mut v: Vec<u8> = vec![1, 1, 1, 1, 1, 1, 1, 1];
let test = reinterpret(&v);
println!("{:?}", test);
}
As a note, this could also be done with std::slice::from_raw_parts rather than align_to. However, that requires manually dealing with the alignment, and all it really gives is more things we need to ensure we're doing right. Well, that and compatibility with older compilers - align_to was introduced in 2018 in Rust 1.30.0, and wouldn't have existed when this question was asked.
Alternative - Copying
If you do need a Vec<u32> for long term data storage, I think the best option is to just allocate new memory. The old memory is allocated for u8s anyways, and wouldn't work.
This can be made fairly simple with some functional programming:
fn reinterpret(v: &[u8]) -> Option<Vec<u32>> {
let v_len = v.len();
if v_len % 4 != 0 {
None
} else {
let result = v
.chunks_exact(4)
.map(|chunk: &[u8]| -> u32 {
let chunk: [u8; 4] = chunk.try_into().unwrap();
let value = u32::from_ne_bytes(chunk);
value
})
.collect();
Some(result)
}
}
First, we use <[T]>::chunks_exact to iterate over chunks of 4 u8s. Next, try_into to convert from &[u8] to [u8; 4]. The &[u8] is guaranteed to be length 4, so this never fails.
We use u32::from_ne_bytes to convert the bytes into a u32 using native endianness. If interacting with a network protocol, or on-disk serialization, then using from_be_bytes or from_le_bytes may be preferable. And finally, we collect to turn our result back into a Vec<u32>.
As a last note, a truly general solution might use both of these techniques. If we change the return type to Cow<'_, [u32]>, we could return aligned, borrowed data if it works, and allocate a new array if it doesn't! Not quite the best of both worlds, but close.