Might seem like a dumb question, however, I've been trying to restrict a string slice (that comes as parameter to a function) to a number of bytes.
This is what I've tried
fn padd_address(addr: &str) -> Result<web3::types::H256, std::error::Error> {
if addr.strip_prefix("0x").unwrap().as_bytes().len() > std::mem::size_of::<[u8; 20]>() {
std::error::Error()
}
let padded = &format!("0x{}", format!("{:0>64}", addr))[..];
H256::from_str(padded).unwrap()
}
Now assume I have an address like this
let addr = "0xdAC17F958D2ee523a2206206994597C13D831ec7";. The function will firstly, strip 0x prefix and after will compute the nb of bytes associated with the string.
Now if I were to println!("{:?}", addr.strip_prefix("0x").unwrap().as_bytes().len()) I get 40 bytes in return instead of 20 which is the actual size of a contract address.
Any thoughts how I can check whether the string that comes as parameter, i.e addr has 20 bytes only?
I get 40 bytes in return instead of 20 which is the actual size of a contract address.
No, your address is
dAC17F958D2ee523a2206206994597C13D831ec7
which is in fact 40 bytes. as_bytes() simply returns the bytes data underlying the string, so it returns
b"dAC17F958D2ee523a2206206994597C13D831ec7"
it does not decode anything. Which incidentally makes it useless syntactic overhead as it's literally what str::len does.
If you want the underlying value, you need something like the hex crate to decode your address from hexadecimal to binary (or hand-roll the same decoding), at which point you will have 20 bytes of data.
Related
Just had this error pop up while messing around with some graphics for a terminal interface...
thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside '░' (bytes 1..4) of ░▒▓█', src/main.rs:38:6
Can I not use these characters, or do I need to work some magic to support what I thought were default ASCII characters?
(Here's the related code for those wondering.)
// Example call with the same parameters that led to this issue.
charlist(" ░▒▓█".to_string(), 0.66);
// Returns the n-th character in a string.
// (Where N is a float value from 0 to 1,
// 0 being the start of the string and 1 the end.)
fn charlist<'a>(chars: &'a String, amount: f64) -> &'a str {
let chl: f64 = chars.chars().count() as f64; // Length of the string
let chpos = -((amount*chl)%chl) as i32; // Scalar converted to integer position in the string
&chars[chpos as usize..chpos as usize+1] // Slice the single requested character
}
There are couple misconceptions you seem to have. So let me address them in order.
░, ▒, ▓ and █ are not ASCII characters! They are unicode code points. You can determine this with following simple experiment.
fn main() {
let slice = " ░▒▓█";
for c in slice.chars() {
println!("{}, {}", c, c.len_utf8());
}
}
This code has output:
, 1
░, 3
▒, 3
▓, 3
█, 3
As you can see this "boxy" characters have a length of 3 bytes each! Rust uses utf-8 encoding for all of it's strings. This leads to another misconception.
I this line &chars[chpos as usize..chpos as usize+1] you are trying to get a slice of one byte in length. String slices in rust are indexed with bytes. But you tried to slice in the middle of a character (it has length of 3 bytes). In general characters in utf-8 encoding can be from one to four bytes in length. To get char's length in bytes you can use method len_utf8.
You can get an iterator of characters in a string slice using method chars. Then getting n-th character is as easy as using iterators method nth So the following is true:
assert_eq!(" ░▒▓█".chars().nth(3).unwrap(), '▒');
If you want to have also indices of this chars you can use method char_indices.
Using f64 values to represent nth character is odd and I would encourage you rethink if you really want to do this. But if you do you have two options.
You must remember that since characters have a variable length, string's slice method len doesn't return number of characters, but slice's length in bytes. To know how many characters are in the string you have no other option than iterating over it. So if you for example want to have a middle character you must first know how many there are. I can think of two ways you can do this.
You can either collect characters for Vec<char> (or something similar). Then you will know how many characters are there and can in O(1) index nth one. However this will result in additional memory allocation.
You can fist count how many characters there are with slice.chars().len(). Then calculate position of the nth one and get it by again iterating over chars and getting the nth one (as I showed above). This won't result in any additional memory allocation, but it will have complexity of O(2n), since you will have to iterate over whole string twice.
Which one you pick is up to you. You will have to make a compromise.
This isn't really a correctness problem, but prefer using &str over &String in the arguments of functions (as it will provide more flexibility to your callers). And you don't have to specify lifetime if you have only one reference in the arguments and the other one is in the returned type. Rust will infer that they have to have the same lifetime.
This isn't the exact use case, but it is basically what I am trying to do:
let mut username = "John_Smith";
println!("original username: {}",username);
username.set_char_at(4,'.'); // <------------- The part I don't know how to do
println!("new username: {}",username);
I can't figure out how to do this in constant time and using no additional space. I know I could use "replace" but replace is O(n). I could make a vector of the characters but that would require additional space.
I think you could create another variable that is a pointer using something like as_mut_slice, but this is deemed unsafe. Is there a safe way to replace a character in a string in constant time and space?
As of Rust 1.27 you can now use String::replace_range:
let mut username = String::from("John_Smith");
println!("original username: {}", username); // John_Smith
username.replace_range(4..5, ".");
println!("new username: {}", username); // John.Smith
(playground)
replace_range won't work with &mut str. If the size of the range and the size of the replacement string aren't the same, it has to be able to resize the underlying String, so &mut String is required. But in the case you ask about (replacing a single-byte character with another single-byte character) its memory usage and time complexity are both O(1).
There is a similar method on Vec, Vec::splice. The primary difference between them is that splice returns an iterator that yields the removed items.
In general ? For any pair of characters ? It's impossible.
A string is not an array. It may be implemented as an array, in some limited contexts.
Rust supports Unicode, which brings some challenges:
a Unicode code point might is an integral between 0 and 224
a grapheme may be composed of multiple Unicode code points
In order to represent this, a Rust string is (for now) a UTF-8 bytes sequence:
a single Unicode code point might be represented by 1 to 4 bytes
a grapheme might be represented by 1 or more bytes (no upper limit)
and therefore, the very notion of "replacing character i" brings a few challenges:
the position of character i is between the index i and the end of the string, it requires reading the string from the beginning to know exactly where though, which is O(N)
switching the i-th character in-place for another requires that both characters take up exactly the same amount of bytes
In general ? It's impossible.
In a particular and very specific case where the byte index is known and the byte encoding is known coincide length-wise, it is doable by directly modifying the byte sequence return by as_mut_bytes which is duly marked unsafe since you may inadvertently corrupt the string (remember, this bytes sequence must be a UTF-8 sequence).
If you want to handle only ASCII there is separate type for that:
use std::ascii::{AsciiCast, OwnedAsciiCast};
fn main() {
let mut ascii = "ascii string".to_string().into_ascii();
*ascii.get_mut(6) = 'S'.to_ascii();
println!("result = {}", ascii);
}
There are some missing pieces (like into_ascii for &str) but it does what you want.
Current implementaion of to_/into_ascii fails if input string is invalid ascii. There is to_ascii_opt (old naming of methods that might fail) but will probably be renamed to to_ascii in the future (and failing method removed or renamed).
fn main() {
let start = 1; // 0x16fdda220
let a = 1; // 0x16fdda224
let s = String::from("x"); // 0x16fdda228
let ss = &s; // 0x16fdda228
let b = 1; // 0x16fdda24c
println!("{:p} {:p} {:p} {:p} {:p}", &start, &a, &s, &b, ss);
}
I have comment the printed addresses on the right. My question is:
Why the difference of start and a is 4 bytes? I expect it to be 8 bytes, because I'm on a 64bit Macbook
Why the difference of b and s is 36 bytes? I expect it to be 32 bytes: 8 bytes for String's internal buffer pointer, 8 bytes for String's length, 8 bytes for String's capacity and 8 bytes for ss
Why the difference of start and a is 4 bytes? I expect it to be 8 bytes, because I'm on a 64bit Macbook
Since a type was not specified integers are assumed to be i32s. If you want a 64bit type, you need to specify it explicitly. This is so they they coerce to the same type on every system. If you want an integer that changes depending on the word size you can use a usize or isize.
Why the difference of b and s is 36 bytes? I expect it to be 32 bytes: 8 bytes for String's internal buffer pointer, 8 bytes for String's length, 8 bytes for String's capacity and 8 bytes for ss
Internally a String is just a Vec<u8>. In the documentation it describes how a Vec<u8> contains a pointer, length, and capacity giving it a total length of 24 bytes in your case. An &String does indeed take up another 8 bytes for the pointer leaving the last 4 unknown bytes directly before b.
While I am not entirely sure, I am guessing that this is a side product of how you called String::from. The compiler may have determined that String::from required some space on the stack to return the result and shifted d over to make room assuming that it would be initialized before the call.
Either way though, it doesn't really tell us much since this was with optimization disabled. It leaves spaces since it never attempted to remove them in the first place. If you run this code snippet in release mode (Enabled via --release) it completely removes these spaces.
Edit: Rust also currently has an issue with not re-using stack space of moved objects. While somewhat unlikely for this use case, you can see the tracking issue here: https://github.com/rust-lang/rust/issues/85230
Why the difference of start and a is 4 bytes? I expect it to be 8 bytes, because I'm on a 64bit Macbook
Because the default integer type is i32, and it occupies 4 bytes. You can specify start as 1usize and then it may be eight bytes, but note that nothing is guaranteed about stack layout.
Why the difference of b and s is 36 bytes? I expect it to be 32 bytes: 8 bytes for String's internal buffer pointer, 8 bytes for String's length, 8 bytes for String's capacity and 8 bytes for ss
nothing is guaranteed about stack layout.
Why the difference of start and a is 4 bytes? I expect it to be 8 bytes, because I'm on a 64bit Macbook
The default type for integers in Rust, when unconstrained by other requirements, is i32 (4 bytes). start and a are therefore both 4-byte signed integers.
Why the difference of b and s is 36 bytes? I expect it to be 32 bytes: 8 bytes for String's internal buffer pointer, 8 bytes for String's length, 8 bytes for String's capacity and 8 bytes for ss
In debug mode, on my system the difference is 24 bytes, not 36. Likely, ss is being optimized away.
In release mode, the difference is 4 bytes, hinting that the compiler was able to eliminate the entire heap allocation for the string (since it's never used) and therefore eliminate most of String's data members as well.
Basically, there's no point in trying to reason about the layout of the stack. The compiler and optimizer have complete freedom to arrange things however they want as long as the observable behavior of the program isn't changed.
Can someone explain why I got different capacity when converting the same string in []rune?
Take a look at this code
package main
import (
"fmt"
)
func main() {
input := "你好"
runes := []rune(input)
fmt.Printf("len %d\n", len(input))
fmt.Printf("len %d\n", len(runes))
fmt.Printf("cap %d\n", cap(runes))
fmt.Println(runes[:3])
}
Which return
len 6
len 2
cap 2
panic: runtime error: slice bounds out of range [:3] with capacity 2
But when commenting the fmt.Println(runes[:3]) it return :
len 6
len 2
cap 32
See how the []rune capacity has changed in the main from 2 to 32. How ? Why ?
If you want to test => Go playground
The capacity may change to whatever as long as the result slice of the conversion contains the runes of the input string. This is the only thing the spec requires and guarantees. The compiler may make decisions to use lower capacity if you pass it to fmt.Println() as this signals that the slice may escape. Again, the decision made by the compiler is out of your hands.
Escape means the value may escape from the function, and as such, it must be allocated on the heap (and not on the stack), because the stack may get destroyed / overwritten once the function returns, and if the value "escapes" from the function, its memory area must be retained as long as there is a reference to the value. The Go compiler performs escape analysis, and if it can't prove a value does not escape the function it's declared in, the value will be allocated on the heap.
See related question: Calculating sha256 gives different results after appending slices depending on if I print out the slice before or not
The reason the string and []rune return different results from len is that it's counting different things; len(string) returns the length in bytes (which may be more than the number of characters, for multi-byte characters), while len([]rune) returns the length of the rune slice, which in turn is the number of UTF-8 runes (generally the number of characters).
This blog post goes into detail how exactly Go treats text in various forms: https://blog.golang.org/strings
I am trying to efficiently count runes from a utf-8 string using the utf8 library. Is this example optimal in that it does not copy the underlying data?
https://golang.org/pkg/unicode/utf8/#example_DecodeRuneInString
func main() {
str := "Hello, 世界" // let's assume a runtime-provided string
for len(str) > 0 {
r, size := utf8.DecodeRuneInString(str)
fmt.Printf("%c %v\n", r, size)
str = str[size:] // performs copy?
}
}
I found StringHeader in the (unsafe) reflect library. Is this the exact structure of a string in Go? If so, it is conceivable that slicing a string merely updates Data or allocates a new StringHeader altogether.
type StringHeader struct {
Data uintptr
Len int
}
Bonus: where can I find the code that performs string slicing so that I could look it up myself? Any of these?
https://golang.org/src/runtime/slice.go
https://golang.org/src/runtime/string.go
This related SO answer suggests that runtime-strings incur a copy when converted from string to []byte.
Slicing Strings
does slice of string perform copy of underlying data?
No it does not. See this post by Russ Cox:
A string is represented in memory as a 2-word structure containing a pointer to the string data and a length. Because the string is immutable, it is safe for multiple strings to share the same storage, so slicing s results in a new 2-word structure with a potentially different pointer and length that still refers to the same byte sequence. This means that slicing can be done without allocation or copying, making string slices as efficient as passing around explicit indexes.
-- Go Data Structures
Slices, Performance, and Iterating Over Runes
A slice is basically three things: a length, a capacity, and a pointer to a location in an underlying array.
As such, slices themselves are not very large: ints and a pointer (possibly some other small things in implementation detail). So the allocation required to make a copy of a slice is very small, and doesn't depend on the size of the underlying array. And no new allocation is required when you simply update the length, capacity, and pointer location, such as on line 2 of:
foo := []int{3, 4, 5, 6}
foo = foo[1:]
Rather, it's when a new underlying array has to be allocated that a performance impact is felt.
Strings in Go are immutable. So to change a string you need to make a new string. However, strings are closely related to byte slices, e.g. you can create a byte slice from a string with
foo := `here's my string`
fooBytes := []byte(foo)
I believe that will allocate a new array of bytes, because:
a string is in effect a read-only slice of bytes
according to the Go Blog (see Strings, bytes, runes and characters in Go). In general you can use a slice to change the contents of an underlying array, so to produce a usable byte slice from a string you would have to make a copy to keep the user from changing what's supposed to be immutable.
You could use performance profiling and benchmarking to gain further insight into the performance of your program.
Once you have your slice of bytes, fooBytes, reslicing it does not allocate a new array, it just allocates a new slice, which is small. This appears to be what slicing a string does as well.
Note that you don't need to use the utf8 package to count words in a utf8 string, though you may proceed that way if you like. Go handles utf8 natively. However if you want to iterate over characters you can't represent the string as a slice of bytes, because you could have multibyte characters. Instead you need to represent it as a slice of runes:
foo := `here's my string`
fooRunes := []rune(foo)
This operation of converting a string to a slice of runes is fast in my experience (trivial in benchmarks I've done, but there may be an allocation). Now you can iterate across fooRunes to count words, no utf8 package required. Alternatively, you can skip the explicit []rune(foo) conversion and do it implicitly by using a for ... range loop on the string, because those are special:
A for range loop, by contrast, decodes one UTF-8-encoded rune on each iteration. Each time around the loop, the index of the loop is the starting position of the current rune, measured in bytes, and the code point is its value.
-- Strings, bytes, runes and characters in Go