get byte offset after first char of str in rust - rust

In rust, I want to get the byte offset immediately after of the first character of a str.
Rust Playground
fn main() {
let s: &str = "⚀⚁";
// char is 4 bytes, right??? (not always when in a str)
let offset: usize = 4;
let s1: &str = &s[offset..];
eprintln!("s1 {:?}", s1);
}
The program expectedly crashes with:
thread 'main' panicked at 'byte index 4 is not a char boundary; it is inside '⚁' (bytes 3..6) of `⚀⚁`'
How can find the byte offset for the second char '⚁' ?
Bonus if this can be done safely and without std.
Related:
How to get the byte offset between &str
How to find the starting offset of a string slice of another string?

A char is a 32-bit integer (a unicode scalar value), but individual characters inside a str are variable width UTF-8, as small as a single 8-bit byte.
You can iterate through the characters of the str and their boundaries using str::char_indices, and your code would look like this:
fn main() {
let s: &str = "⚀⚁";
let (offset, _) = s.char_indices().nth(1).unwrap();
dbg!(offset); // 3
let s1: &str = &s[offset..];
eprintln!("s1 {:?}", s1); // s1 "⚁"
}

Related

How can I convert LPWSTR into &str

In my previous question I asked how to cast LPVOID to LPNETRESOURCEW. And it goes well. I have a struct of NETRESOURCEW with fields:
dwScope: DWORD,
dwType: DWORD,
dwDisplayType: DWORD,
dwUsage: DWORD,
lpLocalName: LPWSTR,
lpRemoteName: LPWSTR,
lpComment: LPWSTR,
lpProvider: LPWSTR,
According to docs nr.lpRemoteName is LPWSTR -> *mut u16. I've tried using OsString::from_wide but nothing got well. How can I convert LPWSTR into Rust *str or String and print it in console?
This is normally done by marshalling the string pointer into a slice.
First you need to get the length of the string / slice which can be done like this if the string is null-terminated:
let length = (0..).take_while(|&i| *my_string.offset(i) != 0).count();
Then you can create the slice like so
let slice = std::slice::from_raw_parts(my_string, length);
and finally convert the slice into an OsString:
let my_rust_string = OsString::from_wide(slice).to_string_lossy().into_owned();
Update due to comment
I have verified that the given approach works as can be reproduced using this snippet:
use std::{ffi::OsString, os::windows::prelude::OsStringExt};
use windows_sys::w;
fn main() {
let my_string = w!("This is a UTF-16 string.");
let slice;
unsafe {
let length = (0..).take_while(|&i| *my_string.offset(i) != 0).count();
slice = std::slice::from_raw_parts(my_string, length);
}
let my_rust_string = OsString::from_wide(slice).to_string_lossy().into_owned();
println!("{}", my_rust_string);
}
If you get a STATUS_ACCESS_VIOLATION it is most likely because your string is not null-terminated in which case you would need to determine the length of the string in another way or preallocate the buffer.

How to create a &str from a single character? [duplicate]

This question already has answers here:
Converting a char to &str
(3 answers)
Closed 1 year ago.
I can't believe I'm asking this frankly, but how do I create a &str (or a String) when I have a single character?
The first thing to try for simple conversions is into().
It works for String because String implements From<char>.
let c: char = 'π';
let s: String = c.into();
You can't build a &str directly from a char. A &str is a reference type. The easiest solution is to build it from a string:
let s: &str = &s;
An alternative for most kinds of values is the format macro:
let s = format!("{}", c);
If just need to use the &str locally and you want to avoid heap allocation, you can use char method encode_utf8:
fn main() {
let c = 'n';
let mut tmp = [0; 1];
let foo = c.encode_utf8(&mut tmp);
println!("str: {}", foo);
}
or
fn main() {
let tmp = [b'n'; 1];
let foo = std::str::from_utf8(&tmp).unwrap();
println!("str: {}", foo);
}
To work with every char you need to use a u8 array of length 4 [0; 4]. In utf8, ascii chars can be represented as a single byte, but all other characters require more bytes with maximum of 4.
This is a simplified example based on an answer from a very similar question:
Converting a char to &str

Difference between double quotes and single quotes in Rust

I was doing the adventofcode of 2020 day 3 in Rust to train a little bit because I am new to Rust and I my code would not compile depending if I used single quotes or double quotes on my "tree" variable
the first code snippet would not compile and throw the error: expected u8, found &[u8; 1]
use std::fs;
fn main() {
let text: String = fs::read_to_string("./data/text").unwrap();
let vec: Vec<&str> = text.lines().collect();
let vec_vertical_len = vec.len();
let vec_horizontal_len = vec[0].len();
let mut i_pointer: usize = 0;
let mut j_pointer: usize = 0;
let mut tree_counter: usize = 0;
let tree = b"#";
loop {
i_pointer += 3;
j_pointer += 1;
if j_pointer >= vec_vertical_len {
break;
}
let i_index = i_pointer % vec_horizontal_len;
let character = vec[j_pointer].as_bytes()[i_index];
if character == tree {
tree_counter += 1
}
}
println!("{}", tree_counter);
}
the second snippet compiles and gives the right answer..
use std::fs;
fn main() {
let text: String = fs::read_to_string("./data/text").unwrap();
let vec: Vec<&str> = text.lines().collect();
let vec_vertical_len = vec.len();
let vec_horizontal_len = vec[0].len();
let mut i_pointer: usize = 0;
let mut j_pointer: usize = 0;
let mut tree_counter: usize = 0;
let tree = b'#';
loop {
i_pointer += 3;
j_pointer += 1;
if j_pointer >= vec_vertical_len {
break;
}
let i_index = i_pointer % vec_horizontal_len;
let character = vec[j_pointer].as_bytes()[i_index];
if character == tree {
tree_counter += 1
}
}
println!("{}", tree_counter);
}
I did not find any reference explaining what is going on when using single or double quotes..can someone help me?
The short answer is it works similarly to java. Single quotes for characters and double quotes for strings.
let a: char = 'k';
let b: &'static str = "k";
The b'' or b"" prefix means take what I have here and interpret as byte literals instead.
let a: u8 = b'k';
let b: &'static [u8; 1] = b"k";
The reason strings result in references is due to how they are stored in the compiled binary. It would be too inefficient to store a string constant inside each method, so strings get put at the beginning of the binary in header area. When your program is being executed, you are taking a reference to the bytes in that header (hence the static lifetime).
Going further down the rabbit hole, single quotes technically hold a codepoint. This is essentially what you might think of as a character. So a Unicode character would also be considered a single codepoint even though it may be multiple bytes long. A codepoint is assumed to fit into a u32 or less so you can safely convert any char by using as u32, but not the other way around since not all u32 values will match valid codepoints. This also means b'\u{x}' is not valid since \u{x} may produce characters that will not fit within a single byte.
// U+1F600 is a unicode smiley face
let a: char = '\u{1F600}';
assert_eq!(a as u32, 0x1F600);
However, you might find it interesting to know that since Rust strings are stored as UTF-8, codepoints over 127 will occupy multiple bytes in a string despite fitting into a single byte on their own. As you may already know, UTF-8 is simply a way of converting codepoints to bytes and back again.
let foo: &'static str = "\u{1F600}";
let foo_chars: Vec<char> = foo.chars().collect();
let foo_bytes: Vec<u8> = foo.bytes().collect();
assert_eq!(foo_chars.len(), 1);
assert_eq!(foo_bytes.len(), 4);
assert_eq!(foo_chars[0] as u32, 0x1F600);
assert_eq!(foo_bytes, vec![240, 159, 152, 128]);

How to convert a Rust char to an integer so that '1' becomes 1?

I am trying to find the sum of the digits of a given number. For example, 134 will give 8.
My plan is to convert the number into a string using .to_string() and then use .chars() to iterate over the digits as characters. Then I want to convert every char in the iteration into an integer and add it to a variable. I want to get the final value of this variable.
I tried using the code below to convert a char into an integer:
fn main() {
let x = "123";
for y in x.chars() {
let z = y.parse::<i32>().unwrap();
println!("{}", z + 1);
}
}
(Playground)
But it results in this error:
error[E0599]: no method named `parse` found for type `char` in the current scope
--> src/main.rs:4:19
|
4 | let z = y.parse::<i32>().unwrap();
| ^^^^^
This code does exactly what I want to do, but first I have to convert each char into a string and then into an integer to then increment sum by z.
fn main() {
let mut sum = 0;
let x = 123;
let x = x.to_string();
for y in x.chars() {
// converting `y` to string and then to integer
let z = (y.to_string()).parse::<i32>().unwrap();
// incrementing `sum` by `z`
sum += z;
}
println!("{}", sum);
}
(Playground)
The method you need is char::to_digit. It converts char to a number it represents in the given radix.
You can also use Iterator::sum to calculate sum of a sequence conveniently:
fn main() {
const RADIX: u32 = 10;
let x = "134";
println!("{}", x.chars().map(|c| c.to_digit(RADIX).unwrap()).sum::<u32>());
}
my_char as u32 - '0' as u32
Now, there's a lot more to unpack about this answer.
It works because the ASCII (and thus UTF-8) encodings have the Arabic numerals 0-9 ordered in ascending order. You can get the scalar values and subtract them.
However, what should it do for values outside this range? What happens if you provide 'p'? It returns 64. What about '.'? This will panic. And '♥' will return 9781.
Strings are not just bags of bytes. They are UTF-8 encoded and you cannot just ignore that fact. Every char can hold any Unicode scalar value.
That's why strings are the wrong abstraction for the problem.
From an efficiency perspective, allocating a string seems inefficient. Rosetta Code has an example of using an iterator which only does numeric operations:
struct DigitIter(usize, usize);
impl Iterator for DigitIter {
type Item = usize;
fn next(&mut self) -> Option<Self::Item> {
if self.0 == 0 {
None
} else {
let ret = self.0 % self.1;
self.0 /= self.1;
Some(ret)
}
}
}
fn main() {
println!("{}", DigitIter(1234, 10).sum::<usize>());
}
If c is your character you can just write:
c as i32 - 0x30;
Test with:
let c:char = '2';
let n:i32 = c as i32 - 0x30;
println!("{}", n);
output:
2
NB: 0x30 is '0' in ASCII table, easy enough to remember!
Another way is to iterate over the characters of your string and convert and add them using fold.
fn sum_of_string(s: &str) -> u32 {
s.chars().fold(0, |acc, c| c.to_digit(10).unwrap_or(0) + acc)
}
fn main() {
let x = "123";
println!("{}", sum_of_string(x));
}

Creating a string from Vec<char> [duplicate]

This question already has answers here:
How to convert Vec<char> to a string
(2 answers)
Closed 6 years ago.
I've got a Vec<char> that I need to turn into a &str or String, but I'm unsure of the best way to do this. I've looked around and every resource I've found seems to be out-dated in some way. The answers in this question don't seem to be applicable for the newest build.
I'm using the nightly for 2015-3-19
The iterator based approach with .collect should work, after updating for language changes:
char_vector.iter().cloned().collect::<String>();
(I've chosen to replace .map(|c| *c) with .cloned() but either works.)
If your vector can be consumed, you can also use into_iter to avoid the clone
fn main() {
let char_vector = vec!['h', 'e', 'l', 'l', 'o'];
let str: String = char_vector.into_iter().collect();
println!("{}", str);
}
You can convert the Vec into a String without doing any allocations. It requires quite some unsafe code though:
#![feature(raw, unicode)]
use std::raw::Repr;
use std::slice::from_raw_parts_mut;
fn inplace_to_string(v: Vec<char>) -> String {
unsafe {
let mut i = 0;
{
let ch_v = &v[..];
let r = ch_v.repr();
let p: &mut [u8] = from_raw_parts_mut(r.data as *mut u8, r.len*4);
for ch in ch_v {
i += ch.encode_utf8(&mut p[i..i+4]).unwrap();
}
}
let p = v.as_ptr();
let cap = v.capacity()*4;
std::mem::forget(v);
let v = Vec::from_raw_parts(p as *mut u8, i, cap);
String::from_utf8_unchecked(v)
}
}
fn main() {
let char_vector = vec!['h', 'ä', 'l', 'l', 'ö'];
let str: String = char_vector.iter().cloned().collect();
let str2 = inplace_to_string(char_vector);
println!("{}", str);
println!("{}", str2);
}
PlayPen
Detailed Explanation
This creates a mutable u8 slice and a char slice simultaneously to the same buffer (breaking all Rust guarantees). Note that the u8 slice is four times as large as the char slice, since char always takes up 4 bytes.
let ch_v = &v[..];
let r = ch_v.repr();
let v: &mut [u8] = from_raw_parts_mut(r.data as *mut u8, r.len*4);
We need that to iterate over the unicode chars and replace them by their utf8 encoded counterpart. Since utf8 is always shorter or the same length as unicode, we can guarantee that we never overwrite any part we haven't read yet.
for ch in ch_v {
i += ch.encode_utf8(&mut v[i..i+4]).unwrap();
}
Since a char is always unicode and our buffer is always exactly 4 bytes (which is the maximum number of bytes a utf8 encoded unicode char will need), we can encode our chars to utf8 without checking if it worked (it will always work). The encode_utf8 function returns the length of the utf8 representation. Our index i is the location of the last written utf8 char.
Finally we need to do some cleaning up. Our vector is still of type Vec<char>. We get all the info we need (Pointer to the heap allocated array and the capacity)
let p = v.as_ptr();
let cap = v.capacity()*4;
Then we release the previous vector from all obligations like freeing memory.
std::mem::forget(v);
and finally recreate the u8 vector with correct length and capacity, and directly turn it into a String. The conversion to String does not need to be checked, as we already know the utf8 is correct, since the original Vec<char> could only contain correct unicode chars.
let v = Vec::from_raw_parts(p as *mut u8, i, cap);
String::from_utf8_unchecked(v)

Resources