Indexing a String - rust

I want to perform a very simple task, but I cannot manage to stop the compiler from complaining.
fn transform(s: String) -> String {
let bytes = s.as_bytes();
format!("{}/{}", bytes[0..2], bytes[2..4])
}
[u8] does not have a constant size known at compile-time.
Some tips making this operation to work as intended?

Indeed, the size of a [u8] isn't known at compile time. The size of &[u8] however is known at compile time because it's just a pointer plus a usize representing the length of sequence.
format!("{:?}/{:?}", &bytes[0..2], &bytes[2..4])
Rust strings are encoded in utf-8, so working with strings in this way is generally a bad idea because a single unicode character may consist of multiple bytes.

Related

Why can fixed-size arrays be on the stack, but str cannot?

Answers to What are the differences between Rust's `String` and `str`? describe how &str and String relate to each other.
What is surprising is that a str is more limited than a fixed-sized array, because it cannot be declared as a local variable. Compiling
let arr_owned = [0u8; 32];
let arr_slice = &arr_owned;
let str_slice = "apple";
let str_owned = *str_slice;
in Rust 1.32.0, I get
error[E0277]: the size for values of type `str` cannot be known at compilation time
--> src/lib.rs:6:9
which is confusing, because the size of "apple" can be known by the compiler, it is just not part of the str type.
Is there a linguistic reason for the asymmetry between Vec<T> <-> [T; N] and String <-> str owned types? Could an str[N] type, which would be a shortand to a [u8; N] that only contains provably valid UTF-8 encoded strings, replace str without breaking lots of existing code?
asymmetry between Vec<T> <-> [T; N] and String <-> str
That's because you confused something here. The relationships are rather like this:
Vec<T> ⇔ [T]
String ⇔ str
In all those four types, the length information is stored at runtime, not compile time. Fixed size arrays ([T; N]) are different in that regard: they store the length at compile time, but not runtime!
And indeed, both [T] and str can't be stored on the stack, because they are both unsized.
Could an str[N] type, which would be a shorthand to a [u8; N] that only contains provably valid UTF-8 encoded strings, replace str without breaking lots of existing code?
It wouldn't replace str, but it could be an interesting addition indeed! But there are probably reasons why it doesn't exist yet, e.g. because the length of a Unicode string is usually not really relevant. In particular, it usually doesn't make sense to "take a Unicode string with exactly three bytes".
[T] and str can't be stored on the stack, because they are both unsized
While this is true today, it may not be true in the future. RFC 1909 introduces unsized rvalues. One of the powers that this feature would give is variable-length arrays:
The RFC also describes an extension to the array literal syntax: [e; dyn n]. In the syntax, n isn't necessarily a constant expression. The array is dynamically allocated on the stack
No mention is made of whether a string will be directly possible, but one could always create a stack-allocated array of bytes to be used as storage for a string.

Does Rust provide a way to parse integer numbers directly from ASCII data in byte (u8) arrays?

Rust has FromStr, however as far as I can see this only takes Unicode text input. Is there an equivalent to this for [u8] arrays?
By "parse" I mean take ASCII characters and return an integer, like C's atoi does.
Or do I need to either...
Convert the u8 array to a string first, then call FromStr.
Call out to libc's atoi.
Write an atoi in Rust.
In nearly all cases the first option is reasonable, however there are cases where files maybe be very large, with no predefined encoding... or contain mixed binary and text, where its most straightforward to read integer numbers as bytes.
No, the standard library has no such feature, but it doesn't need one.
As stated in the comments, the raw bytes can be converted to a &str via:
str::from_utf8
str::from_utf8_unchecked
Neither of these perform extra allocation. The first one ensures the bytes are valid UTF-8, the second does not. Everyone should use the checked form until such time as profiling proves that it's a bottleneck, then use the unchecked form once it's proven safe to do so.
If bytes deeper in the data need to be parsed, a slice of the raw bytes can be obtained before conversion:
use std::str;
fn main() {
let raw_data = b"123132";
let the_bytes = &raw_data[1..4];
let the_string = str::from_utf8(the_bytes).expect("not UTF-8");
let the_number: u64 = the_string.parse().expect("not a number");
assert_eq!(the_number, 231);
}
As in other code, these these lines can be extracted into a function or a trait to allow for reuse. However, once that path is followed, it would be a good idea to look into one of the many great crates aimed at parsing. This is especially true if there's a need to parse binary data in addition to textual data.
I do not know of any way in the standard library, but maybe the atoi crate works for you? Full disclosure: I am its author.
use atoi::atoi;
let (number, digits) = atoi::<u32>(b"42 is the answer"); //returns (42,2)
You can check if the second element of the tuple is a zero to see if the slice starts with a digit.
let (number, digits) = atoi::<u32>(b"x"); //returns (0,0)
let (number, digits) = atoi::<u32>(b"0"); //returns (0,1)

Convert a Vec<u16> or Vec<WCHAR> to a &str

I'm getting into Rust programming to realize a small program and I'm a little bit lost in string conversions.
In my program, I have a vector as follows:
let mut name: Vec<winnt::WCHAR> = Vec::new();
WCHAR is the same as a u16 on my Windows machine.
I hand over the Vec<u16> to a C function (as a pointer) which fills it with data. I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
The only thing I managed to get working is to convert it to a WideString:
widestr = unsafe { WideCString::from_ptr_str(name.as_ptr()) };
But this seems to be a step into the wrong direction.
What is the best way to convert the Vec<u16> to an &str under the assumption that the vector holds a valid and null-terminated string.
I then need to convert the string contained in the vector into a &str. However, no matter, what I try, I can not manage to get this conversion working.
There's no way of making this a "free" conversion.
A &str is a Unicode string encoded with UTF-8. This is a byte-oriented encoding. If you have UTF-16 (or the different but common UCS-2 encoding), there's no way to read one as the other. That's equivalent to trying to read a JPEG image as a PDF. Both chunks of data might be a string, but the encoding is important.
The first question is "do you really need to do that?". Many times, you can take data from one function and shovel it back into another function, never looking at it. If you can get away with that, that might be be best answer.
If you do need to transform it, then you have to deal with the errors that can occur. An arbitrary array of 16-bit integers may not be valid UTF-16 or UCS-2. These encodings have edge cases that can easily produce invalid strings. Null-termination is another aspect - Unicode actually allows for embedded NUL characters, so a null-terminated string can't hold all possible Unicode characters!
Once you've ensured that the encoding is valid 1 and figured out how many entries in the input vector comprise the string, then you have to decode the input format and re-encode to the output format. This is likely to require some kind of new allocation, so you are most likely to end up with a String, which can then be used most anywhere a &str can be used.
There is a built-in method to convert UTF-16 data to a String: String::from_utf16. Note that it returns a Result to allow for these error cases. There's also String::from_utf16_lossy, which replaces invalid encoded parts with the Unicode replacement character.
let name = [0x68, 0x65, 0x6c, 0x6c, 0x6f];
let a = String::from_utf16(&name);
let b = String::from_utf16_lossy(&name);
println!("{:?}", a);
println!("{:?}", b);
If you are starting from a pointer to a u16 or WCHAR, you will need to convert to a slice first by using slice::from_raw_parts. If you have a null-terminated string, you need to find the NUL yourself and slice the input appropriately.
1: This is actually a great way of using types; a &str is guaranteed to be UTF-8 encoded, so no further check needs to be made. Similarly, the WideCString is likely to perform a check once upon construction and then can skip the check on later uses.
This is my simple hack for this case. There must be a bug; fix for your own case:
let mut v = vec![0u16; MAX_PATH as usize];
// imaginary win32 function
win32_function(v.as_mut_ptr());
let mut path = String::new();
for val in v.iter() {
let c: u8 = (*val & 0xFF) as u8;
if c == 0 {
break;
} else {
path.push(c as char);
}
}

Efficiently extract prefix substrings

Currently I'm using the following function to extract prefix substrings:
fn prefix(s: &String, k: usize) -> String {
s.chars().take(k).collect::<String>()
}
This can then be used for comparisons like so:
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
However, this allocates a new String for each call to prefix, in addition to the processing for the iteration. Most other languages I'm familiar with have an efficient way to do a comparison like this, using just a view of the strings. Is there a way in Rust?
Yes, you can take subslices of strings using the Index operation:
fn prefix(s: &str, k: usize) -> &str {
&s[..k]
}
fn main() {
let my_string = "ACGT".to_string();
let same = prefix(&my_string, 3) == prefix(&my_string, 2);
println!("{}", same);
}
Note that slicing a string uses bytes as the unit, not characters. It is up to the programmer to ensure that the slice lengths lie on valid UTF-8 boundaries. Additionally, you have to ensure that you don't try to slice past the end of the string. Breaking either of these will result in a panic!.
A bit more defensive version would be
fn prefix(s: &str, k: usize) -> &str {
let idx = s.char_indices().nth(k).map(|(idx, _)| idx).unwrap_or(s.len());
&s[0..idx]
}
The key difference is that we use the char_indices iterator, which tells us the byte offsets corresponding to a character. Indexing into a UTF-8 string is an O(n) operation, and Rust doesn't want to hide that algorithmic complexity from you. This still isn't even complete, because there can be combining characters, for example. Dealing with strings is hard, thanks to the complexity of human language.
Most other languages I'm familiar with have an efficient way
Doubtful :-) To be efficient in time, they'd have to know how many bytes to skip ahead for every character. Either they'd have to keep a lookup table for every string or use a fixed-size character encoding. Both of those solutions can use more memory than needed, and a fixed size encoding doesn't even work when you have combining characters, for example.
Of course, other languages could just say "LOL, strings are just arrays of bytes, good luck with treating them correctly", and efficiently ignore your character encoding...
Two additional notes
Your predicate doesn't really make sense. A string of 2 letters will never match one of 3 letters. For strings to match, they must have the same amount of bytes.
You should never need to take &String as a function argument. Taking a &str is a more accepting argument in all cases except for one teeny tiny little case that no one needs — knowing the capacity of a String, but without being able to modify the string.
While Shepmaster's answer is absolutely correct for the general case of string slicing, I'd like to add that sometimes there are easier ways.
If you know in advance the set of characters you're working with ("ATGC" example suggests you're working with nucleobases, so it is possible that these are all the characters you need) then you can use slices of bytes &[u8] instead of string slices &str. You can always get a byte slice out of a string slice and a Vec<u8> out of a String, if necessary:
let s: String = "ATGC".into();
let ss: &str = &s;
let b: Vec<u8> = s.into_bytes();
let bs: &[u8] = ss.as_slice();
Also, there are byte slice and byte character literals, just prefix regular string/char literals with b:
let sl: &[u8] = b"ATGC";
let bl: u8 = b'G';
Working with byte slices give you constant-time indexing (and thus slicing) operations, so checking for prefix equality is easy (like Shepmaster's first variant but without possibility of panics (unless k is too large):
fn prefix(s: &[u8], k: usize) -> &[u8] {
&s[..k]
}
If you need, you can turn byte slices/vectors back to strings. This operation, of course, checks validity of UTF-8 encoding so it may fail, but if you only work with ASCII, you can safely ignore these errors and just unwrap():
let ss2: &str = str::from_utf8(bs).unwrap();
let s2: String = String::from_utf8(b).unwrap();

How to shuffle a str in place

I want to shuffle a String in place in Rust, but I seem to miss something. The fix is probably trivial...
use std::rand::{Rng, thread_rng};
fn main() {
// I want to shuffle this string...
let mut value: String = "SomeValue".to_string();
let mut bytes = value.as_bytes();
let mut slice: &mut [u8] = bytes.as_mut_slice();
thread_rng().shuffle(slice);
println!("{}", value);
}
The error I get is
<anon>:8:36: 8:41 error: cannot borrow immutable dereference of `&`-pointer `*bytes` as mutable
<anon>:8 let mut slice: &mut [u8] = bytes.as_mut_slice();
^~~~~
I read about String::as_mut_vec() but it's unsafe so I'd rather not use it.
There's no very good way to do this, partly due to the nature of the UTF-8 encoding of strings, and partly due to the inherent properties of Unicode and text.
There's at least three layers of things that could be shuffled in a UTF-8 string:
the raw bytes
the encoded codepoints
the graphemes
Shuffling raw bytes is likely to give an invalid UTF-8 string as output unless the string is entirely ASCII. Non-ASCII characters are encoded as special sequences of multiple bytes, and shuffling these will almostly certainly not get them in the right order at the end. Hence shuffling bytes is often not good.
Shuffling codepoints (char in Rust) makes a little bit more sense, but there is still the concept of "special sequences", where so-called combining characters can be layered on to a single letter adding diacritics etc (e.g. letters like ä can be written as a plus U+0308, the codepoint representing the diaeresis). Hence shuffling characters won't give an invalid UTF-8 string, but it may break up these codepoint sequences and give nonsense output.
This brings me to graphemes: the sequences of codepoints that make up a single visible character (like ä is still a single grapheme when written as one or as two codepoints). This will give the most reliably sensible answer.
Then, once you've decided which you want to shuffle the shuffling strategy can be made:
if the string is guaranteed to be purely ASCII, shuffling the bytes with .shuffle is sensible (with the ASCII assumption, this is equivalent to the others)
otherwise, there's no standard way to operate in-place, one would get the elements as an iterator (.chars() for codepoints or .graphemes(true) for graphemes), place them into a vector with .collect::<Vec<_>>(), shuffle the vector, and then collect everything back into a new String with e.g. .iter().map(|x| *x).collect::<String>().
The difficulty of handling codepoints and graphemes is because UTF-8 does not encode them as fixed width, so there's no way to take a random codepoint/grapheme out and insert it somewhere else, or otherwise swap two elements efficiently... Without just decoding everything into an external Vec.
Not being in-place is unfortunate, but strings are hard.
(If your strings are guaranteed to be ASCII, then using a type like the Ascii provided by ascii would be a good way to keep things straight, at the type-level.)
As an example of the difference of the three things, take a look at:
fn main() {
let s = "U͍̤͕̜̲̼̜n̹͉̭͜ͅi̷̪c̠͍̖̻o̸̯̖de̮̻͍̤";
println!("bytes: {}", s.bytes().count());
println!("chars: {}", s.chars().count());
println!("graphemes: {}", s.graphemes(true).count());
}
It prints:
bytes: 57
chars: 32
graphemes: 7
(Generate your own, it demonstrates putting multiple combining character on to a single letter.)
Putting together the suggestion above:
use std::rand::{Rng, thread_rng};
fn str_shuffled(s: &str) -> String {
let mut graphemes = s.graphemes(true).collect::<Vec<&str>>();
let mut gslice = graphemes.as_mut_slice();
let mut rng = thread_rng();
rng.shuffle(gslice);
gslice.iter().map(|x| *x).collect::<String>()
}
fn main() {
println!("{}", str_shuffled("Hello, World!"));
println!("{}", str_shuffled("selam dünya"));
println!("{}", str_shuffled("你好世界"));
println!("{}", str_shuffled("γειά σου κόσμος"));
println!("{}", str_shuffled("Здравствулте мир"));
}
I am also a beginner with Rust, but what about:
fn main() {
// I want to shuffle this string...
let value = "SomeValue".to_string();
let mut bytes = value.into_bytes();
bytes[0] = bytes[1]; // Shuffle takes place.. sorry but std::rand::thread_rng is not available in the Rust installed on my current machine.
match String::from_utf8(bytes) { // Should not copy the contents according to documentation.
Ok(s) => println!("{}", s),
_ => println!("Error occurred!")
}
}
Also keep in mind that Rust default string encoding is UTF-8 when fiddling around with sequences of bytes. ;)
This was a great suggestion, lead me to the following solution, thanks!
use std::rand::{Rng, thread_rng};
fn main() {
// I want to shuffle this string...
let value: String = "SomeValue".to_string();
let mut bytes = value.into_bytes();
thread_rng().shuffle(&mut *bytes.as_mut_slice());
match String::from_utf8(bytes) { // Should not copy the contents according to documentation.
Ok(s) => println!("{}", s),
_ => println!("Error occurred!")
}
}
rustc 0.13.0-nightly (ad9e75938 2015-01-05 00:26:28 +0000)

Resources