How is String Pointer allocated on stack? - rust

fn main() {
let start = 1; // 0x16fdda220
let a = 1; // 0x16fdda224
let s = String::from("x"); // 0x16fdda228
let ss = &s; // 0x16fdda228
let b = 1; // 0x16fdda24c
println!("{:p} {:p} {:p} {:p} {:p}", &start, &a, &s, &b, ss);
}
I have comment the printed addresses on the right. My question is:
Why the difference of start and a is 4 bytes? I expect it to be 8 bytes, because I'm on a 64bit Macbook
Why the difference of b and s is 36 bytes? I expect it to be 32 bytes: 8 bytes for String's internal buffer pointer, 8 bytes for String's length, 8 bytes for String's capacity and 8 bytes for ss

Why the difference of start and a is 4 bytes? I expect it to be 8 bytes, because I'm on a 64bit Macbook
Since a type was not specified integers are assumed to be i32s. If you want a 64bit type, you need to specify it explicitly. This is so they they coerce to the same type on every system. If you want an integer that changes depending on the word size you can use a usize or isize.
Why the difference of b and s is 36 bytes? I expect it to be 32 bytes: 8 bytes for String's internal buffer pointer, 8 bytes for String's length, 8 bytes for String's capacity and 8 bytes for ss
Internally a String is just a Vec<u8>. In the documentation it describes how a Vec<u8> contains a pointer, length, and capacity giving it a total length of 24 bytes in your case. An &String does indeed take up another 8 bytes for the pointer leaving the last 4 unknown bytes directly before b.
While I am not entirely sure, I am guessing that this is a side product of how you called String::from. The compiler may have determined that String::from required some space on the stack to return the result and shifted d over to make room assuming that it would be initialized before the call.
Either way though, it doesn't really tell us much since this was with optimization disabled. It leaves spaces since it never attempted to remove them in the first place. If you run this code snippet in release mode (Enabled via --release) it completely removes these spaces.
Edit: Rust also currently has an issue with not re-using stack space of moved objects. While somewhat unlikely for this use case, you can see the tracking issue here: https://github.com/rust-lang/rust/issues/85230

Why the difference of start and a is 4 bytes? I expect it to be 8 bytes, because I'm on a 64bit Macbook
Because the default integer type is i32, and it occupies 4 bytes. You can specify start as 1usize and then it may be eight bytes, but note that nothing is guaranteed about stack layout.
Why the difference of b and s is 36 bytes? I expect it to be 32 bytes: 8 bytes for String's internal buffer pointer, 8 bytes for String's length, 8 bytes for String's capacity and 8 bytes for ss
nothing is guaranteed about stack layout.

Why the difference of start and a is 4 bytes? I expect it to be 8 bytes, because I'm on a 64bit Macbook
The default type for integers in Rust, when unconstrained by other requirements, is i32 (4 bytes). start and a are therefore both 4-byte signed integers.
Why the difference of b and s is 36 bytes? I expect it to be 32 bytes: 8 bytes for String's internal buffer pointer, 8 bytes for String's length, 8 bytes for String's capacity and 8 bytes for ss
In debug mode, on my system the difference is 24 bytes, not 36. Likely, ss is being optimized away.
In release mode, the difference is 4 bytes, hinting that the compiler was able to eliminate the entire heap allocation for the string (since it's never used) and therefore eliminate most of String's data members as well.
Basically, there's no point in trying to reason about the layout of the stack. The compiler and optimizer have complete freedom to arrange things however they want as long as the observable behavior of the program isn't changed.

Related

Problem splitting regular ASCII symbols in a string

Just had this error pop up while messing around with some graphics for a terminal interface...
thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside '░' (bytes 1..4) of ░▒▓█', src/main.rs:38:6
Can I not use these characters, or do I need to work some magic to support what I thought were default ASCII characters?
(Here's the related code for those wondering.)
// Example call with the same parameters that led to this issue.
charlist(" ░▒▓█".to_string(), 0.66);
// Returns the n-th character in a string.
// (Where N is a float value from 0 to 1,
// 0 being the start of the string and 1 the end.)
fn charlist<'a>(chars: &'a String, amount: f64) -> &'a str {
let chl: f64 = chars.chars().count() as f64; // Length of the string
let chpos = -((amount*chl)%chl) as i32; // Scalar converted to integer position in the string
&chars[chpos as usize..chpos as usize+1] // Slice the single requested character
}
There are couple misconceptions you seem to have. So let me address them in order.
░, ▒, ▓ and █ are not ASCII characters! They are unicode code points. You can determine this with following simple experiment.
fn main() {
let slice = " ░▒▓█";
for c in slice.chars() {
println!("{}, {}", c, c.len_utf8());
}
}
This code has output:
, 1
░, 3
▒, 3
▓, 3
█, 3
As you can see this "boxy" characters have a length of 3 bytes each! Rust uses utf-8 encoding for all of it's strings. This leads to another misconception.
I this line &chars[chpos as usize..chpos as usize+1] you are trying to get a slice of one byte in length. String slices in rust are indexed with bytes. But you tried to slice in the middle of a character (it has length of 3 bytes). In general characters in utf-8 encoding can be from one to four bytes in length. To get char's length in bytes you can use method len_utf8.
You can get an iterator of characters in a string slice using method chars. Then getting n-th character is as easy as using iterators method nth So the following is true:
assert_eq!(" ░▒▓█".chars().nth(3).unwrap(), '▒');
If you want to have also indices of this chars you can use method char_indices.
Using f64 values to represent nth character is odd and I would encourage you rethink if you really want to do this. But if you do you have two options.
You must remember that since characters have a variable length, string's slice method len doesn't return number of characters, but slice's length in bytes. To know how many characters are in the string you have no other option than iterating over it. So if you for example want to have a middle character you must first know how many there are. I can think of two ways you can do this.
You can either collect characters for Vec<char> (or something similar). Then you will know how many characters are there and can in O(1) index nth one. However this will result in additional memory allocation.
You can fist count how many characters there are with slice.chars().len(). Then calculate position of the nth one and get it by again iterating over chars and getting the nth one (as I showed above). This won't result in any additional memory allocation, but it will have complexity of O(2n), since you will have to iterate over whole string twice.
Which one you pick is up to you. You will have to make a compromise.
This isn't really a correctness problem, but prefer using &str over &String in the arguments of functions (as it will provide more flexibility to your callers). And you don't have to specify lifetime if you have only one reference in the arguments and the other one is in the returned type. Rust will infer that they have to have the same lifetime.

How to imply a specific number of bytes to a string slice

Might seem like a dumb question, however, I've been trying to restrict a string slice (that comes as parameter to a function) to a number of bytes.
This is what I've tried
fn padd_address(addr: &str) -> Result<web3::types::H256, std::error::Error> {
if addr.strip_prefix("0x").unwrap().as_bytes().len() > std::mem::size_of::<[u8; 20]>() {
std::error::Error()
}
let padded = &format!("0x{}", format!("{:0>64}", addr))[..];
H256::from_str(padded).unwrap()
}
Now assume I have an address like this
let addr = "0xdAC17F958D2ee523a2206206994597C13D831ec7";. The function will firstly, strip 0x prefix and after will compute the nb of bytes associated with the string.
Now if I were to println!("{:?}", addr.strip_prefix("0x").unwrap().as_bytes().len()) I get 40 bytes in return instead of 20 which is the actual size of a contract address.
Any thoughts how I can check whether the string that comes as parameter, i.e addr has 20 bytes only?
I get 40 bytes in return instead of 20 which is the actual size of a contract address.
No, your address is
dAC17F958D2ee523a2206206994597C13D831ec7
which is in fact 40 bytes. as_bytes() simply returns the bytes data underlying the string, so it returns
b"dAC17F958D2ee523a2206206994597C13D831ec7"
it does not decode anything. Which incidentally makes it useless syntactic overhead as it's literally what str::len does.
If you want the underlying value, you need something like the hex crate to decode your address from hexadecimal to binary (or hand-roll the same decoding), at which point you will have 20 bytes of data.

unsafe.SizeOf() says any string takes 16 bytes, but how? [duplicate]

This question already has an answer here:
String memory usage in Golang
(1 answer)
Closed 2 years ago.
Simply running fmt.Println(unsafe.Sizeof("")) prints 16. Changing the content of the string doesn't affect the outcome.
Can someone explain how where this number (16) come from?
Strings in Go are represented by reflect.StringHeader containing a pointer to actual string data and a length of string:
type StringHeader struct {
Data uintptr
Len int
}
unsafe.Sizeof(s) will only return the size of StringHeader struct but not the pointed data itself. So (in your example) it will be sum of 8 bytes for Data and 8 bytes for Len making it 16 bytes.

How capacity of []rune is determined when converting from a string

Can someone explain why I got different capacity when converting the same string in []rune?
Take a look at this code
package main
import (
"fmt"
)
func main() {
input := "你好"
runes := []rune(input)
fmt.Printf("len %d\n", len(input))
fmt.Printf("len %d\n", len(runes))
fmt.Printf("cap %d\n", cap(runes))
fmt.Println(runes[:3])
}
Which return
len 6
len 2
cap 2
panic: runtime error: slice bounds out of range [:3] with capacity 2
But when commenting the fmt.Println(runes[:3]) it return :
len 6
len 2
cap 32
See how the []rune capacity has changed in the main from 2 to 32. How ? Why ?
If you want to test => Go playground
The capacity may change to whatever as long as the result slice of the conversion contains the runes of the input string. This is the only thing the spec requires and guarantees. The compiler may make decisions to use lower capacity if you pass it to fmt.Println() as this signals that the slice may escape. Again, the decision made by the compiler is out of your hands.
Escape means the value may escape from the function, and as such, it must be allocated on the heap (and not on the stack), because the stack may get destroyed / overwritten once the function returns, and if the value "escapes" from the function, its memory area must be retained as long as there is a reference to the value. The Go compiler performs escape analysis, and if it can't prove a value does not escape the function it's declared in, the value will be allocated on the heap.
See related question: Calculating sha256 gives different results after appending slices depending on if I print out the slice before or not
The reason the string and []rune return different results from len is that it's counting different things; len(string) returns the length in bytes (which may be more than the number of characters, for multi-byte characters), while len([]rune) returns the length of the rune slice, which in turn is the number of UTF-8 runes (generally the number of characters).
This blog post goes into detail how exactly Go treats text in various forms: https://blog.golang.org/strings

Why does ToBase64String change a 16 byte string to 24 bytes

I have the following code. When I check the value of variable i it is 16 bytes but then when the output is converted to Base64 it is 24 bytes.
byte[] bytOut = ms.GetBuffer();
int i = 0;
for (i = 0; i < bytOut.Length; i++)
if (bytOut[i] == 0)
break;
// convert into Base64 so that the result can be used in xml
return System.Convert.ToBase64String(bytOut, 0, i);
Is this expected? I am trying to cut down storage and this is one of my problems.
Base64 expresses the input string made of 8-bit bytes using 64 human-readable characters (64 characters = 6 bits of information).
The key to the answer of your question is that it the encoding works in 24 bit chunks, so every 24 bits or fraction thereof results in 4 characters of output.
16 bytes * 8 bits = 128 bits of information
128 bits / 24 bits per chunk = 5.333 chunks
So the final output will be 6 chunks or 24 characters.
The fractional chunks are handled with equal signs, which represent the trailing "null bits". In your case, the output will always end in '=='.
Yes, you'd expect to see some expansion. You're representing your data in a base with only 64 characters. All those unprintable ASCII characters still need a way to be encoded though. So you end up with slight expansion of the data.
Here's a link that explains how much: Base64: What is the worst possible increase in space usage?
Edit: Based on your comment above, if you need to reduce size, you should look at compressing the data before you encrypt. This will get you the max benefit from compression. Compressing encrypted binary does not work.
This is because a base64 string can contain only 64 characters ( and that is because it should be displayable) in other hand and byte has a variety of 256 characters so it can contain more information in it.
Base64 is a great way to represent binary data in a string using only standard, printable characters. It is not, however, a good way to represent string data because it takes more characters than the original string.

Resources