Proper way to access Vec<&[u8]> as strings [duplicate]

Proper way to access Vec<&[u8]> as strings [duplicate] - string

This question already has answers here:
How do I convert a Vector of bytes (u8) to a string?
(5 answers)
Closed 3 years ago.
I have a Vec<&[u8]> that I want to convert to a String like this:
let rfrce: Vec<&[u8]> = rec.alleles();
for r in rfrce {
// create new String from rfrce
}
I tried this but it is not working since only converting u8 to char is possible, but [u8] to char is not:
let rfrce = rec.alleles();
let mut str = String::from("");
for r in rfrce {
str.push(*r as char);
}

Because r is an array of u8, you need to convert it to a valid &str and use push_str method of String.
use std::str;
fn main() {
let rfrce = vec![&[65,66,67], &[68,69,70]];
let mut str = String::new();
for r in rfrce {
str.push_str(str::from_utf8(r).unwrap());
}
println!("{}", str);
}
Rust Playground

I'd go with TryFrom<u32>:
fn to_string(v: &[&[u8]]) -> Result<String, std::char::CharTryFromError> {
/// Transform a &[u8] to an UTF-8 codepoint
fn su8_to_u32(s: &[u8]) -> Option<u32> {
if s.len() > 4 {
None
} else {
let shift = (0..=32).step_by(8);
let result = s.iter().rev().cloned().zip(shift).map(|(u, shift)| (u as u32) << shift).sum();
Some(result)
}
}
use std::convert::TryFrom;
v.iter().map(|&s| su8_to_u32(s)).try_fold(String::new(), |mut s, u| {
let u = u.unwrap(); //TODO error handling
s.push(char::try_from(u)?);
Ok(s)
})
}
fn main() {
let rfrce: Vec<&[u8]> = vec![&[48][..], &[49][..], &[50][..], &[51][..]];
assert_eq!(to_string(&rfrce), Ok("0123".into()));
let rfrce: Vec<&[u8]> = vec![&[0xc3, 0xa9][..]]; // https://www.utf8icons.com/character/50089/utf-8-character
assert_eq!(to_string(&rfrce), Ok("쎩".into()));
}

Related

How can I convert from Vec<char> to u32 in Rust without going through String?

My rust code runs in an environment where I have no access to std::string and std::* (but I have access to core::str). How can I convert a Vec<char> to u32 without going through String, such as:
let num_in_chars: Vec<char> = vec!['1', '2'];
// some process here
// let num = ...
// This is how I could do it if I have access to `String`
// let num = num_in_chars.iter().collect::<String>().parse::<u32>().unwrap();
assert_eq!(12, num);
Thanks

You must convert each char to a digit (in the map) and then you multiply each previous result by 10 and you add the new digit:
/// Returns `None` in case of invalid digit.
pub fn vec_to_int(digits: impl IntoIterator<Item = char>) -> Option<u32> {
const RADIX: u32 = 10;
digits
.into_iter()
.map(|c| c.to_digit(RADIX))
.try_fold(0, |ans, i| i.map(|i| ans * RADIX + i))
}
#[test]
fn it_works() {
let nums = vec!['1', '2'];
let num = vec_to_int(nums);
assert_eq!(Some(12), num);
}
#[test]
fn invalid_digit() {
let nums = vec!['1', 'a'];
let num = vec_to_int(nums);
assert_eq!(None, num);
}

Quick function to convert a String's first letter to uppercase? [duplicate]

This question already has answers here:
Why is capitalizing the first letter of a string so convoluted in Rust?
(9 answers)
Closed 4 years ago.
Does anyone know a function that changes the first letter of a String to the uppercase equivalent?
Idealy, it would be used as so:
let newfoo = first_letter_to_uppper_case("foobar".to_string()), or
let newfoo = "foobar".to_string().first_letter_to_uppper_case().

If you want a function used as so:
let newfoo = first_letter_to_uppper_case("foobar".to_string())
Try use the following:
fn main() {
println!("{}", first_letter_to_uppper_case("foobar".to_string()));
}
fn first_letter_to_uppper_case (s1: String) -> String {
let mut c = s1.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
}
If you want it as a function implimented on the string type, like let newfoo = "foobar".to_string().first_letter_to_uppper_case(), try:
pub trait FirstLetterToUppperCase {
fn first_letter_to_uppper_case(self) -> String;
}
impl FirstLetterToUppperCase for String {
fn first_letter_to_uppper_case(self) -> String {
let mut c = self.chars();
match c.next() {
None => String::new(),
Some(f) => f.to_uppercase().collect::<String>() + c.as_str(),
}
}
}
fn main() {
println!("{}", "foobar".to_string().first_letter_to_uppper_case());
}
However, these functions do not deal with non-ascii characters very well. For more information, see this answer.

Is there a method like JavaScript's substr in Rust?

I looked at the Rust docs for String but I can't find a way to extract a substring.
Is there a method like JavaScript's substr in Rust? If not, how would you implement it?
str.substr(start[, length])
The closest is probably slice_unchecked but it uses byte offsets instead of character indexes and is marked unsafe.

For characters, you can use s.chars().skip(pos).take(len):
fn main() {
let s = "Hello, world!";
let ss: String = s.chars().skip(7).take(5).collect();
println!("{}", ss);
}
Beware of the definition of Unicode characters though.
For bytes, you can use the slice syntax:
fn main() {
let s = b"Hello, world!";
let ss = &s[7..12];
println!("{:?}", ss);
}

You can use the as_str method on the Chars iterator to get back a &str slice after you have stepped on the iterator. So to skip the first start chars, you can call
let s = "Some text to slice into";
let mut iter = s.chars();
iter.by_ref().nth(start); // eat up start values
let slice = iter.as_str(); // get back a slice of the rest of the iterator
Now if you also want to limit the length, you first need to figure out the byte-position of the length character:
let end_pos = slice.char_indices().nth(length).map(|(n, _)| n).unwrap_or(0);
let substr = &slice[..end_pos];
This might feel a little roundabout, but Rust is not hiding anything from you that might take up CPU cycles. That said, I wonder why there's no crate yet that offers a substr method.

This code performs both substring-ing and string-slicing, without panicking nor allocating:
use std::ops::{Bound, RangeBounds};
trait StringUtils {
fn substring(&self, start: usize, len: usize) -> &str;
fn slice(&self, range: impl RangeBounds<usize>) -> &str;
}
impl StringUtils for str {
fn substring(&self, start: usize, len: usize) -> &str {
let mut char_pos = 0;
let mut byte_start = 0;
let mut it = self.chars();
loop {
if char_pos == start { break; }
if let Some(c) = it.next() {
char_pos += 1;
byte_start += c.len_utf8();
}
else { break; }
}
char_pos = 0;
let mut byte_end = byte_start;
loop {
if char_pos == len { break; }
if let Some(c) = it.next() {
char_pos += 1;
byte_end += c.len_utf8();
}
else { break; }
}
&self[byte_start..byte_end]
}
fn slice(&self, range: impl RangeBounds<usize>) -> &str {
let start = match range.start_bound() {
Bound::Included(bound) | Bound::Excluded(bound) => *bound,
Bound::Unbounded => 0,
};
let len = match range.end_bound() {
Bound::Included(bound) => *bound + 1,
Bound::Excluded(bound) => *bound,
Bound::Unbounded => self.len(),
} - start;
self.substring(start, len)
}
}
fn main() {
let s = "abcdèfghij";
// All three statements should print:
// "abcdè, abcdèfghij, dèfgh, dèfghij."
println!("{}, {}, {}, {}.",
s.substring(0, 5),
s.substring(0, 50),
s.substring(3, 5),
s.substring(3, 50));
println!("{}, {}, {}, {}.",
s.slice(..5),
s.slice(..50),
s.slice(3..8),
s.slice(3..));
println!("{}, {}, {}, {}.",
s.slice(..=4),
s.slice(..=49),
s.slice(3..=7),
s.slice(3..));
}

For my_string.substring(start, len)-like syntax, you can write a custom trait:
trait StringUtils {
fn substring(&self, start: usize, len: usize) -> Self;
}
impl StringUtils for String {
fn substring(&self, start: usize, len: usize) -> Self {
self.chars().skip(start).take(len).collect()
}
}
// Usage:
fn main() {
let phrase: String = "this is a string".to_string();
println!("{}", phrase.substring(5, 8)); // prints "is a str"
}

The solution given by oli_obk does not handle last index of string slice. It can be fixed with .chain(once(s.len())).
Here function substr implements a substring slice with error handling. If invalid index is passed to function, then a valid part of string slice is returned with Err-variant. All corner cases should be handled correctly.
fn substr(s: &str, begin: usize, length: Option<usize>) -> Result<&str, &str> {
use std::iter::once;
let mut itr = s.char_indices().map(|(n, _)| n).chain(once(s.len()));
let beg = itr.nth(begin);
if beg.is_none() {
return Err("");
} else if length == Some(0) {
return Ok("");
}
let end = length.map_or(Some(s.len()), |l| itr.nth(l-1));
if let Some(end) = end {
return Ok(&s[beg.unwrap()..end]);
} else {
return Err(&s[beg.unwrap()..s.len()]);
}
}
let s = "abc🙂";
assert_eq!(Ok("bc"), substr(s, 1, Some(2)));
assert_eq!(Ok("c🙂"), substr(s, 2, Some(2)));
assert_eq!(Ok("c🙂"), substr(s, 2, None));
assert_eq!(Err("c🙂"), substr(s, 2, Some(99)));
assert_eq!(Ok(""), substr(s, 2, Some(0)));
assert_eq!(Err(""), substr(s, 5, Some(4)));
Note that this does not handle unicode grapheme clusters. For example, "y̆es" contains 4 unicode chars but 3 grapheme clusters. Crate unicode-segmentation solves this problem. Unicode grapheme clusters are handled correctly if part
let mut itr = s.char_indices()...
is replaced with
use unicode_segmentation::UnicodeSegmentation;
let mut itr = s.grapheme_indices(true)...
Then also following works
assert_eq!(Ok("y̆"), substr("y̆es", 0, Some(1)));

Knowing about the various syntaxes of the slice type might be beneficial for some of the readers.
Reference to a part of a string
&s[6..11]
If you start at index 0, you can omit the value
&s[0..1] ^= &s[..1]
Equivalent if your substring contains the last byte of the string
&s[3..s.len()] ^= &s[3..]
This also applies when the slice encompasses the entire string
&s[..]
You can also use the range inclusive operator to include the last value
&s[..=1]
Link to docs: https://doc.rust-lang.org/book/ch04-03-slices.html

I would suggest you use the crate substring. (And look at its source code if you want to learn how to do this properly.)

I couldn't find the exact substr implementation that I'm familiar with from other programming languages like: JavaScript, Dart, and etc.
Here is possible implementation of method substr to &str and String
Let's define a trait for making able to implement functions to default types, (like extensions in Dart).
trait Substr {
fn substr(&self, start: usize, end: usize) -> String;
}
Then implement this trait for &str
impl<'a> Substr for &'a str {
fn substr(&self, start: usize, end: usize) -> String {
if start > end || start == end {
return String::new();
}
self.chars().skip(start).take(end - start).collect()
}
}
Try:
fn main() {
let string = "Hello, world!";
let substring = string.substr(0, 4);
println!("{}", substring); // Hell
}

You can also use .to_string()[ <range> ].
This example takes an immutable slice of the original string, then mutates that string to demonstrate the original slice is preserved.
let mut s: String = "Hello, world!".to_string();
let substring: &str = &s.to_string()[..6];
s.replace_range(..6, "Goodbye,");
println!("{} {} universe!", s, substring);
// Goodbye, world! Hello, universe!

I'm not very experienced in Rust but I gave it a try. If someone could correct my answer please don't hesitate.
fn substring(string:String, start:u32, end:u32) -> String {
let mut substr = String::new();
let mut i = start;
while i < end + 1 {
substr.push_str(&*(string.chars().nth(i as usize).unwrap().to_string()));
i += 1;
}
return substr;
}
Here is a playground

How to implement trim for Vec<u8>?

Rust provides a trim method for strings: str.trim() removing leading and trailing whitespace. I want to have a method that does the same for bytestrings. It should take a Vec<u8> and remove leading and trailing whitespace (space, 0x20 and htab, 0x09).
Writing a trim_left() is easy, you can just use an iterator with skip_while(): Rust Playground
fn main() {
let a: &[u8] = b" fo o ";
let b: Vec<u8> = a.iter().map(|x| x.clone()).skip_while(|x| x == &0x20 || x == &0x09).collect();
println!("{:?}", b);
}
But to trim the right characters I would need to look ahead if no other letter is in the list after whitespace was found.

Here's an implementation that returns a slice, rather than a new Vec<u8>, as str::trim() does. It's also implemented on [u8], since that's more general than Vec<u8> (you can obtain a slice from a vector cheaply, but creating a vector from a slice is more costly, since it involves a heap allocation and a copy).
trait SliceExt {
fn trim(&self) -> &Self;
}
impl SliceExt for [u8] {
fn trim(&self) -> &[u8] {
fn is_whitespace(c: &u8) -> bool {
*c == b'\t' || *c == b' '
}
fn is_not_whitespace(c: &u8) -> bool {
!is_whitespace(c)
}
if let Some(first) = self.iter().position(is_not_whitespace) {
if let Some(last) = self.iter().rposition(is_not_whitespace) {
&self[first..last + 1]
} else {
unreachable!();
}
} else {
&[]
}
}
}
fn main() {
let a = b" fo o ";
let b = a.trim();
println!("{:?}", b);
}
If you really need a Vec<u8> after the trim(), you can just call into() on the slice to turn it into a Vec<u8>.
fn main() {
let a = b" fo o ";
let b: Vec<u8> = a.trim().into();
println!("{:?}", b);
}

This is a much simpler version than the other answers.
pub fn trim_ascii_whitespace(x: &[u8]) -> &[u8] {
let from = match x.iter().position(|x| !x.is_ascii_whitespace()) {
Some(i) => i,
None => return &x[0..0],
};
let to = x.iter().rposition(|x| !x.is_ascii_whitespace()).unwrap();
&x[from..=to]
}
Weird that this isn't in the standard library. I would have thought it was a common task.
Anyway here it is as a complete file/trait (with tests!) that you can copy/paste.
use std::ops::Deref;
/// Trait to allow trimming ascii whitespace from a &[u8].
pub trait TrimAsciiWhitespace {
/// Trim ascii whitespace (based on `is_ascii_whitespace()`) from the
/// start and end of a slice.
fn trim_ascii_whitespace(&self) -> &[u8];
}
impl<T: Deref<Target=[u8]>> TrimAsciiWhitespace for T {
fn trim_ascii_whitespace(&self) -> &[u8] {
let from = match self.iter().position(|x| !x.is_ascii_whitespace()) {
Some(i) => i,
None => return &self[0..0],
};
let to = self.iter().rposition(|x| !x.is_ascii_whitespace()).unwrap();
&self[from..=to]
}
}
#[cfg(test)]
mod test {
use super::TrimAsciiWhitespace;
#[test]
fn basic_trimming() {
assert_eq!(b" A ".trim_ascii_whitespace(), b"A");
assert_eq!(b" AB ".trim_ascii_whitespace(), b"AB");
assert_eq!(b"A ".trim_ascii_whitespace(), b"A");
assert_eq!(b"AB ".trim_ascii_whitespace(), b"AB");
assert_eq!(b" A".trim_ascii_whitespace(), b"A");
assert_eq!(b" AB".trim_ascii_whitespace(), b"AB");
assert_eq!(b" A B ".trim_ascii_whitespace(), b"A B");
assert_eq!(b"A B ".trim_ascii_whitespace(), b"A B");
assert_eq!(b" A B".trim_ascii_whitespace(), b"A B");
assert_eq!(b" ".trim_ascii_whitespace(), b"");
assert_eq!(b" ".trim_ascii_whitespace(), b"");
}
}

All we have to do is find the index of the first non-whitespace character, one time counting forward from the start, and another time counting backwards from the end.
fn is_not_whitespace(e: &u8) -> bool {
*e != 0x20 && *e != 0x09
}
fn main() {
let a: &[u8] = b" fo o ";
// find the index of first non-whitespace char
let begin = a.iter()
.position(is_not_whitespace);
// find the index of the last non-whitespace char
let end = a.iter()
.rev()
.position(is_not_whitespace)
.map(|j| a.len() - j);
// build it
let vec = begin.and_then(|i| end.map(|j| a[i..j].iter().collect()))
.unwrap_or(Vec::new());
println!("{:?}", vec);
}

Using the same iterator multiple times in Rust

Editor's note: This code example is from a version of Rust prior to 1.0 when many iterators implemented Copy. Updated versions of this code produce a different errors, but the answers still contain valuable information.
I'm trying to write a function to split a string into clumps of letters and numbers; for example, "test123test" would turn into [ "test", "123", "test" ]. Here's my attempt so far:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
match iter.peek() {
None => return bits,
Some(c) => if c.is_digit() {
bits.push(iter.take_while(|c| c.is_digit()).collect());
} else {
bits.push(iter.take_while(|c| !c.is_digit()).collect());
}
}
}
return bits;
}
However, this doesn't work, looping forever. It seems that it is using a clone of iter each time I call take_while, starting from the same position over and over again. I would like it to use the same iter each time, advancing the same iterator over all the each_times. Is this possible?

As you identified, each take_while call is duplicating iter, since take_while takes self and the Peekable chars iterator is Copy. (Only true before Rust 1.0 — editor)
You want to be modifying the iterator each time, that is, for take_while to be operating on an &mut to your iterator. Which is exactly what the .by_ref adaptor is for:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
match iter.peek().map(|c| *c) {
None => return bits,
Some(c) => if c.is_digit(10) {
bits.push(iter.by_ref().take_while(|c| c.is_digit(10)).collect());
} else {
bits.push(iter.by_ref().take_while(|c| !c.is_digit(10)).collect());
},
}
}
}
fn main() {
println!("{:?}", split("123abc456def"))
}
Prints
["123", "bc", "56", "ef"]
However, I imagine this is not correct.
I would actually recommend writing this as a normal for loop, using the char_indices iterator:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
if input.is_empty() {
return bits;
}
let mut is_digit = input.chars().next().unwrap().is_digit(10);
let mut start = 0;
for (i, c) in input.char_indices() {
let this_is_digit = c.is_digit(10);
if is_digit != this_is_digit {
bits.push(input[start..i].to_string());
is_digit = this_is_digit;
start = i;
}
}
bits.push(input[start..].to_string());
bits
}
This form also allows for doing this with much fewer allocations (that is, the Strings are not required), because each returned value is just a slice into the input, and we can use lifetimes to state this:
pub fn split<'a>(input: &'a str) -> Vec<&'a str> {
let mut bits = vec![];
if input.is_empty() {
return bits;
}
let mut is_digit = input.chars().next().unwrap().is_digit(10);
let mut start = 0;
for (i, c) in input.char_indices() {
let this_is_digit = c.is_digit(10);
if is_digit != this_is_digit {
bits.push(&input[start..i]);
is_digit = this_is_digit;
start = i;
}
}
bits.push(&input[start..]);
bits
}
All that changed was the type signature, removing the Vec<String> type hint and the .to_string calls.
One could even write an iterator like this, to avoid having to allocate the Vec. Something like fn split<'a>(input: &'a str) -> Splits<'a> { /* construct a Splits */ } where Splits is a struct that implements Iterator<&'a str>.

take_while takes self by value: it consumes the iterator. Before Rust 1.0 it also was unfortunately able to be implicitly copied, leading to the surprising behaviour that you are observing.
You cannot use take_while for what you are wanting for these reasons. You will need to manually unroll your take_while invocations.
Here is one of many possible ways of dealing with this:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
let seeking_digits = match iter.peek() {
None => return bits,
Some(c) => c.is_digit(10),
};
if seeking_digits {
bits.push(take_while(&mut iter, |c| c.is_digit(10)));
} else {
bits.push(take_while(&mut iter, |c| !c.is_digit(10)));
}
}
}
fn take_while<I, F>(iter: &mut std::iter::Peekable<I>, predicate: F) -> String
where
I: Iterator<Item = char>,
F: Fn(&char) -> bool,
{
let mut out = String::new();
loop {
match iter.peek() {
Some(c) if predicate(c) => out.push(*c),
_ => return out,
}
let _ = iter.next();
}
}
fn main() {
println!("{:?}", split("test123test"));
}
This yields a solution with two levels of looping; another valid approach would be to model it as a state machine one level deep only. Ask if you aren’t sure what I mean and I’ll demonstrate.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Proper way to access Vec<&[u8]> as strings [duplicate] - string

Related

How can I convert from Vec<char> to u32 in Rust without going through String?

Quick function to convert a String's first letter to uppercase? [duplicate]

Is there a method like JavaScript's substr in Rust?

How to implement trim for Vec<u8>?

Using the same iterator multiple times in Rust

Categories

Resources