I looked at the Rust docs for String but I can't find a way to extract a substring.
Is there a method like JavaScript's substr in Rust? If not, how would you implement it?
str.substr(start[, length])
The closest is probably slice_unchecked but it uses byte offsets instead of character indexes and is marked unsafe.
For characters, you can use s.chars().skip(pos).take(len):
fn main() {
let s = "Hello, world!";
let ss: String = s.chars().skip(7).take(5).collect();
println!("{}", ss);
}
Beware of the definition of Unicode characters though.
For bytes, you can use the slice syntax:
fn main() {
let s = b"Hello, world!";
let ss = &s[7..12];
println!("{:?}", ss);
}
You can use the as_str method on the Chars iterator to get back a &str slice after you have stepped on the iterator. So to skip the first start chars, you can call
let s = "Some text to slice into";
let mut iter = s.chars();
iter.by_ref().nth(start); // eat up start values
let slice = iter.as_str(); // get back a slice of the rest of the iterator
Now if you also want to limit the length, you first need to figure out the byte-position of the length character:
let end_pos = slice.char_indices().nth(length).map(|(n, _)| n).unwrap_or(0);
let substr = &slice[..end_pos];
This might feel a little roundabout, but Rust is not hiding anything from you that might take up CPU cycles. That said, I wonder why there's no crate yet that offers a substr method.
This code performs both substring-ing and string-slicing, without panicking nor allocating:
use std::ops::{Bound, RangeBounds};
trait StringUtils {
fn substring(&self, start: usize, len: usize) -> &str;
fn slice(&self, range: impl RangeBounds<usize>) -> &str;
}
impl StringUtils for str {
fn substring(&self, start: usize, len: usize) -> &str {
let mut char_pos = 0;
let mut byte_start = 0;
let mut it = self.chars();
loop {
if char_pos == start { break; }
if let Some(c) = it.next() {
char_pos += 1;
byte_start += c.len_utf8();
}
else { break; }
}
char_pos = 0;
let mut byte_end = byte_start;
loop {
if char_pos == len { break; }
if let Some(c) = it.next() {
char_pos += 1;
byte_end += c.len_utf8();
}
else { break; }
}
&self[byte_start..byte_end]
}
fn slice(&self, range: impl RangeBounds<usize>) -> &str {
let start = match range.start_bound() {
Bound::Included(bound) | Bound::Excluded(bound) => *bound,
Bound::Unbounded => 0,
};
let len = match range.end_bound() {
Bound::Included(bound) => *bound + 1,
Bound::Excluded(bound) => *bound,
Bound::Unbounded => self.len(),
} - start;
self.substring(start, len)
}
}
fn main() {
let s = "abcdèfghij";
// All three statements should print:
// "abcdè, abcdèfghij, dèfgh, dèfghij."
println!("{}, {}, {}, {}.",
s.substring(0, 5),
s.substring(0, 50),
s.substring(3, 5),
s.substring(3, 50));
println!("{}, {}, {}, {}.",
s.slice(..5),
s.slice(..50),
s.slice(3..8),
s.slice(3..));
println!("{}, {}, {}, {}.",
s.slice(..=4),
s.slice(..=49),
s.slice(3..=7),
s.slice(3..));
}
For my_string.substring(start, len)-like syntax, you can write a custom trait:
trait StringUtils {
fn substring(&self, start: usize, len: usize) -> Self;
}
impl StringUtils for String {
fn substring(&self, start: usize, len: usize) -> Self {
self.chars().skip(start).take(len).collect()
}
}
// Usage:
fn main() {
let phrase: String = "this is a string".to_string();
println!("{}", phrase.substring(5, 8)); // prints "is a str"
}
The solution given by oli_obk does not handle last index of string slice. It can be fixed with .chain(once(s.len())).
Here function substr implements a substring slice with error handling. If invalid index is passed to function, then a valid part of string slice is returned with Err-variant. All corner cases should be handled correctly.
fn substr(s: &str, begin: usize, length: Option<usize>) -> Result<&str, &str> {
use std::iter::once;
let mut itr = s.char_indices().map(|(n, _)| n).chain(once(s.len()));
let beg = itr.nth(begin);
if beg.is_none() {
return Err("");
} else if length == Some(0) {
return Ok("");
}
let end = length.map_or(Some(s.len()), |l| itr.nth(l-1));
if let Some(end) = end {
return Ok(&s[beg.unwrap()..end]);
} else {
return Err(&s[beg.unwrap()..s.len()]);
}
}
let s = "abc🙂";
assert_eq!(Ok("bc"), substr(s, 1, Some(2)));
assert_eq!(Ok("c🙂"), substr(s, 2, Some(2)));
assert_eq!(Ok("c🙂"), substr(s, 2, None));
assert_eq!(Err("c🙂"), substr(s, 2, Some(99)));
assert_eq!(Ok(""), substr(s, 2, Some(0)));
assert_eq!(Err(""), substr(s, 5, Some(4)));
Note that this does not handle unicode grapheme clusters. For example, "y̆es" contains 4 unicode chars but 3 grapheme clusters. Crate unicode-segmentation solves this problem. Unicode grapheme clusters are handled correctly if part
let mut itr = s.char_indices()...
is replaced with
use unicode_segmentation::UnicodeSegmentation;
let mut itr = s.grapheme_indices(true)...
Then also following works
assert_eq!(Ok("y̆"), substr("y̆es", 0, Some(1)));
Knowing about the various syntaxes of the slice type might be beneficial for some of the readers.
Reference to a part of a string
&s[6..11]
If you start at index 0, you can omit the value
&s[0..1] ^= &s[..1]
Equivalent if your substring contains the last byte of the string
&s[3..s.len()] ^= &s[3..]
This also applies when the slice encompasses the entire string
&s[..]
You can also use the range inclusive operator to include the last value
&s[..=1]
Link to docs: https://doc.rust-lang.org/book/ch04-03-slices.html
I would suggest you use the crate substring. (And look at its source code if you want to learn how to do this properly.)
I couldn't find the exact substr implementation that I'm familiar with from other programming languages like: JavaScript, Dart, and etc.
Here is possible implementation of method substr to &str and String
Let's define a trait for making able to implement functions to default types, (like extensions in Dart).
trait Substr {
fn substr(&self, start: usize, end: usize) -> String;
}
Then implement this trait for &str
impl<'a> Substr for &'a str {
fn substr(&self, start: usize, end: usize) -> String {
if start > end || start == end {
return String::new();
}
self.chars().skip(start).take(end - start).collect()
}
}
Try:
fn main() {
let string = "Hello, world!";
let substring = string.substr(0, 4);
println!("{}", substring); // Hell
}
You can also use .to_string()[ <range> ].
This example takes an immutable slice of the original string, then mutates that string to demonstrate the original slice is preserved.
let mut s: String = "Hello, world!".to_string();
let substring: &str = &s.to_string()[..6];
s.replace_range(..6, "Goodbye,");
println!("{} {} universe!", s, substring);
// Goodbye, world! Hello, universe!
I'm not very experienced in Rust but I gave it a try. If someone could correct my answer please don't hesitate.
fn substring(string:String, start:u32, end:u32) -> String {
let mut substr = String::new();
let mut i = start;
while i < end + 1 {
substr.push_str(&*(string.chars().nth(i as usize).unwrap().to_string()));
i += 1;
}
return substr;
}
Here is a playground
Related
This question already has answers here:
Rust: Is there an opposite for split_at_mut (i.e. join_mut)?
(2 answers)
Closed 8 months ago.
When writing a parser I ran into the problem that there are two string slices that come from the same origin string and are next to each other in memory. Of course it would be possible to simply copy the strings and merge them back into one, but that would require unnecessary computational resources. Is there a clean way to solve this in rust without unsafe code? For better illustration, here is an example of how I would like to solve it:
fn main() {
//This is the owned string.
//(Of course, this is also just a slice of a static string, but that makes no difference here).
let origin: &str = "Hello World";
//Substrings which borrow data from the original and should be adjacent in memory
let a: &str = &origin[0..5];
let b: &str = &origin[5..11];
//If the representation of a and b on the stack is:
// a: { ptr: PointerA, len: LenA }
// b: { ptr: PointerB, len: LenB }
//Then PointerA + LenA should be PointerB
//From this I conclude that there must be a way to combine these strings into c,
//which in turn would have this representation on the stack:
// c: { ptr: PointerA, len: LenA + LenB }
//The merge method doesn't actually exist, it's just a example of how I would imagen the api to look like.
let c = a.merge(b).unwrap();
assert!(c == origin)
}
Contrast this with the more inefficient current solution:
fn main() {
let origin: &str = "Hello World";
let a: &str = &origin[0..5];
let b: &str = &origin[5..11];
//Here both strings are simply copied to another location in the heap
//and need unnecessarily more memory, because the stored data exactly matches the data in origin
let c = a.to_owned() + b;
assert!(c == origin)
}
EDIT: This is a example of how i would implemented this with unsafe code, but i really don't know if it is actually safe
fn main() {
let origin: &str = "Hello World";
let a: &str = &origin[0..5];
let b: &str = &origin[5..11];
let c = merge(a, b).unwrap();
assert!(c == origin)
}
fn merge<'a>(one: &'a str, two: &'a str) -> Option<&'a str> {
unsafe {
let one: [usize; 2] = std::intrinsics::transmute(one);
let two: [usize; 2] = std::intrinsics::transmute(two);
if let Some(len) = one[1].checked_add(two[1]) {
if one[0] + one[1] == two[0] {
Some(std::intrinsics::transmute([one[0], len]))
} else {
None
}
} else {
None
}
}
}
You can chain the character within the splitted string:
fn main() {
//This is the owned string.
//(Of course, this is also just a slice of a static string, but that makes no difference here).
let origin: &str = "Hello World";
//Substrings which borrow data from the original and should be adjacent in memory
let a: &str = &origin[0..5];
let b: &str = &origin[5..11];
let c = a.chars().chain(b.chars());
assert_eq!(c.collect::<String>().as_str(), origin);
}
Playground
But notice that for most operation requiring &str, you would have to create a new string anyway.
So it bring us to unsafe life, and as pointed by #chayimfriedman it is UB:
use std::{slice, str};
fn main() {
//This is the owned string.
//(Of course, this is also just a slice of a static string, but that makes no difference here).
let origin: &str = "Hello World";
//Substrings which borrow data from the original and should be adjacent in memory
let a: &str = &origin[0..5];
let b: &str = &origin[5..11];
let c: &str = merge(a, b).unwrap();
assert_eq!(c, origin);
}
fn merge<'a>(a: &'a str, b: &'a str) -> Result<&'a str, String> {
let a_len = a.len();
let a_ptr = a.as_ptr();
let b_ptr = b.as_ptr();
let b_len = b.len();
if a_ptr as usize + a_len != b_ptr as usize {
return Err("Strings are not alighned".to_string());
}
Ok(unsafe {
str::from_utf8_unchecked(slice::from_raw_parts(a_ptr as *const u8, a_len + b_len))
})
}
Playground
Note that eventually you could avoid the pointers casting and instead use ptr::addr, which as for rust 1.62 it is still on nightly.
How can I get rid of the ugly let mut i = 0; and i += 1;? Is there a more idomatic way I can write this loop? I tried .enumerate() but it doesn't work on a &[&str].
use std::fmt::Write;
pub fn build_proverb(list: &[&str]) -> String {
let mut proverb = String::new();
let mut i = 0;
for word in list {
i += 1;
if i < list.len() {
write!(proverb, "For want of a {} the {} was lost.\n", word, list[i]);
} else {
write!(proverb, "And all for the want of a {}.", list[0]);
}
}
proverb
}
You were close, enumerate() doesn't exist for &[&str] because enumerate() is a function on Iterator, and &[&str] is not an Iterator.
But you can call list.iter() to get an iterator, which you can then call enumerate() on.
Full example:
pub fn build_proverb(list: &[&str]) -> String {
let mut proverb = String::new();
for (i, word) in list.iter().enumerate() {
if i < list.len() {
write!(proverb, "For want of a {} the {} was lost.\n", word, list[i]);
} else {
write!(proverb, "And all for the want of a {}.", list[0]);
}
}
proverb
}
The fact you have word and list[i] (after incrementing i) makes it look like you want something like windows:
pub fn build_proverb(list: &[&str]) -> String {
let mut proverb = String::new();
for words in list.windows(2) {
let first = words[0];
let second = words[1];
write!(proverb, "For want of a {} the {} was lost.\n", first, second);
}
write!(proverb, "And all for the want of a {}.", list[0]);
proverb
}
Rust provides a trim method for strings: str.trim() removing leading and trailing whitespace. I want to have a method that does the same for bytestrings. It should take a Vec<u8> and remove leading and trailing whitespace (space, 0x20 and htab, 0x09).
Writing a trim_left() is easy, you can just use an iterator with skip_while(): Rust Playground
fn main() {
let a: &[u8] = b" fo o ";
let b: Vec<u8> = a.iter().map(|x| x.clone()).skip_while(|x| x == &0x20 || x == &0x09).collect();
println!("{:?}", b);
}
But to trim the right characters I would need to look ahead if no other letter is in the list after whitespace was found.
Here's an implementation that returns a slice, rather than a new Vec<u8>, as str::trim() does. It's also implemented on [u8], since that's more general than Vec<u8> (you can obtain a slice from a vector cheaply, but creating a vector from a slice is more costly, since it involves a heap allocation and a copy).
trait SliceExt {
fn trim(&self) -> &Self;
}
impl SliceExt for [u8] {
fn trim(&self) -> &[u8] {
fn is_whitespace(c: &u8) -> bool {
*c == b'\t' || *c == b' '
}
fn is_not_whitespace(c: &u8) -> bool {
!is_whitespace(c)
}
if let Some(first) = self.iter().position(is_not_whitespace) {
if let Some(last) = self.iter().rposition(is_not_whitespace) {
&self[first..last + 1]
} else {
unreachable!();
}
} else {
&[]
}
}
}
fn main() {
let a = b" fo o ";
let b = a.trim();
println!("{:?}", b);
}
If you really need a Vec<u8> after the trim(), you can just call into() on the slice to turn it into a Vec<u8>.
fn main() {
let a = b" fo o ";
let b: Vec<u8> = a.trim().into();
println!("{:?}", b);
}
This is a much simpler version than the other answers.
pub fn trim_ascii_whitespace(x: &[u8]) -> &[u8] {
let from = match x.iter().position(|x| !x.is_ascii_whitespace()) {
Some(i) => i,
None => return &x[0..0],
};
let to = x.iter().rposition(|x| !x.is_ascii_whitespace()).unwrap();
&x[from..=to]
}
Weird that this isn't in the standard library. I would have thought it was a common task.
Anyway here it is as a complete file/trait (with tests!) that you can copy/paste.
use std::ops::Deref;
/// Trait to allow trimming ascii whitespace from a &[u8].
pub trait TrimAsciiWhitespace {
/// Trim ascii whitespace (based on `is_ascii_whitespace()`) from the
/// start and end of a slice.
fn trim_ascii_whitespace(&self) -> &[u8];
}
impl<T: Deref<Target=[u8]>> TrimAsciiWhitespace for T {
fn trim_ascii_whitespace(&self) -> &[u8] {
let from = match self.iter().position(|x| !x.is_ascii_whitespace()) {
Some(i) => i,
None => return &self[0..0],
};
let to = self.iter().rposition(|x| !x.is_ascii_whitespace()).unwrap();
&self[from..=to]
}
}
#[cfg(test)]
mod test {
use super::TrimAsciiWhitespace;
#[test]
fn basic_trimming() {
assert_eq!(b" A ".trim_ascii_whitespace(), b"A");
assert_eq!(b" AB ".trim_ascii_whitespace(), b"AB");
assert_eq!(b"A ".trim_ascii_whitespace(), b"A");
assert_eq!(b"AB ".trim_ascii_whitespace(), b"AB");
assert_eq!(b" A".trim_ascii_whitespace(), b"A");
assert_eq!(b" AB".trim_ascii_whitespace(), b"AB");
assert_eq!(b" A B ".trim_ascii_whitespace(), b"A B");
assert_eq!(b"A B ".trim_ascii_whitespace(), b"A B");
assert_eq!(b" A B".trim_ascii_whitespace(), b"A B");
assert_eq!(b" ".trim_ascii_whitespace(), b"");
assert_eq!(b" ".trim_ascii_whitespace(), b"");
}
}
All we have to do is find the index of the first non-whitespace character, one time counting forward from the start, and another time counting backwards from the end.
fn is_not_whitespace(e: &u8) -> bool {
*e != 0x20 && *e != 0x09
}
fn main() {
let a: &[u8] = b" fo o ";
// find the index of first non-whitespace char
let begin = a.iter()
.position(is_not_whitespace);
// find the index of the last non-whitespace char
let end = a.iter()
.rev()
.position(is_not_whitespace)
.map(|j| a.len() - j);
// build it
let vec = begin.and_then(|i| end.map(|j| a[i..j].iter().collect()))
.unwrap_or(Vec::new());
println!("{:?}", vec);
}
parts.count() leads to ownership transfer, so parts can't be used any more.
fn split(slice: &[u8], splitter: &[u8]) -> Option<Vec<u8>> {
let mut parts = slice.split(|b| splitter.contains(b));
let len = parts.count(); //ownership transfer
if len >= 2 {
Some(parts.nth(1).unwrap().to_vec())
} else if len >= 1 {
Some(parts.nth(0).unwrap().to_vec())
} else {
None
}
}
fn main() {
split(&[1u8, 2u8, 3u8], &[2u8]);
}
It is also possible to avoid unnecessary allocations of Vec if you only need to use the first or the second part:
fn split<'a>(slice: &'a [u8], splitter: &[u8]) -> Option<&'a [u8]> {
let mut parts = slice.split(|b| splitter.contains(b)).fuse();
let first = parts.next();
let second = parts.next();
second.or(first)
}
Then if you actually need a Vec you can map on the result:
split(&[1u8, 2u8, 3u8], &[2u8]).map(|s| s.to_vec())
Of course, if you want, you can move to_vec() conversion to the function:
second.or(first).map(|s| s.to_vec())
I'm calling fuse() on the iterator in order to guarantee that it will always return None after the first None is returned (which is not guaranteed by the general iterator protocol).
The other answers are good suggestions to answer your problem, but I'd like to point out another general solution: create multiple iterators:
fn split(slice: &[u8], splitter: &[u8]) -> Option<Vec<u8>> {
let mut parts = slice.split(|b| splitter.contains(b));
let parts2 = slice.split(|b| splitter.contains(b));
let len = parts2.count();
if len >= 2 {
Some(parts.nth(1).unwrap().to_vec())
} else if len >= 1 {
Some(parts.nth(0).unwrap().to_vec())
} else {
None
}
}
fn main() {
split(&[1u8, 2u8, 3u8], &[2u8]);
}
You can usually create multiple read-only iterators. Some iterators even implement Clone, so you could just say iter.clone().count(). Unfortunately, Split isn't one of them because it owns the passed-in closure.
One thing you can do is collect the results of the split in a new owned Vec, like this:
fn split(slice: &[u8], splitter: &[u8]) -> Option<Vec<u8>> {
let parts: Vec<&[u8]> = slice.split(|b| splitter.contains(b)).collect();
let len = parts.len();
if len >= 2 {
Some(parts.iter().nth(1).unwrap().to_vec())
} else if len >= 1 {
Some(parts.iter().nth(0).unwrap().to_vec())
} else {
None
}
}
Editor's note: This code example is from a version of Rust prior to 1.0 when many iterators implemented Copy. Updated versions of this code produce a different errors, but the answers still contain valuable information.
I'm trying to write a function to split a string into clumps of letters and numbers; for example, "test123test" would turn into [ "test", "123", "test" ]. Here's my attempt so far:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
match iter.peek() {
None => return bits,
Some(c) => if c.is_digit() {
bits.push(iter.take_while(|c| c.is_digit()).collect());
} else {
bits.push(iter.take_while(|c| !c.is_digit()).collect());
}
}
}
return bits;
}
However, this doesn't work, looping forever. It seems that it is using a clone of iter each time I call take_while, starting from the same position over and over again. I would like it to use the same iter each time, advancing the same iterator over all the each_times. Is this possible?
As you identified, each take_while call is duplicating iter, since take_while takes self and the Peekable chars iterator is Copy. (Only true before Rust 1.0 — editor)
You want to be modifying the iterator each time, that is, for take_while to be operating on an &mut to your iterator. Which is exactly what the .by_ref adaptor is for:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
match iter.peek().map(|c| *c) {
None => return bits,
Some(c) => if c.is_digit(10) {
bits.push(iter.by_ref().take_while(|c| c.is_digit(10)).collect());
} else {
bits.push(iter.by_ref().take_while(|c| !c.is_digit(10)).collect());
},
}
}
}
fn main() {
println!("{:?}", split("123abc456def"))
}
Prints
["123", "bc", "56", "ef"]
However, I imagine this is not correct.
I would actually recommend writing this as a normal for loop, using the char_indices iterator:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
if input.is_empty() {
return bits;
}
let mut is_digit = input.chars().next().unwrap().is_digit(10);
let mut start = 0;
for (i, c) in input.char_indices() {
let this_is_digit = c.is_digit(10);
if is_digit != this_is_digit {
bits.push(input[start..i].to_string());
is_digit = this_is_digit;
start = i;
}
}
bits.push(input[start..].to_string());
bits
}
This form also allows for doing this with much fewer allocations (that is, the Strings are not required), because each returned value is just a slice into the input, and we can use lifetimes to state this:
pub fn split<'a>(input: &'a str) -> Vec<&'a str> {
let mut bits = vec![];
if input.is_empty() {
return bits;
}
let mut is_digit = input.chars().next().unwrap().is_digit(10);
let mut start = 0;
for (i, c) in input.char_indices() {
let this_is_digit = c.is_digit(10);
if is_digit != this_is_digit {
bits.push(&input[start..i]);
is_digit = this_is_digit;
start = i;
}
}
bits.push(&input[start..]);
bits
}
All that changed was the type signature, removing the Vec<String> type hint and the .to_string calls.
One could even write an iterator like this, to avoid having to allocate the Vec. Something like fn split<'a>(input: &'a str) -> Splits<'a> { /* construct a Splits */ } where Splits is a struct that implements Iterator<&'a str>.
take_while takes self by value: it consumes the iterator. Before Rust 1.0 it also was unfortunately able to be implicitly copied, leading to the surprising behaviour that you are observing.
You cannot use take_while for what you are wanting for these reasons. You will need to manually unroll your take_while invocations.
Here is one of many possible ways of dealing with this:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
let seeking_digits = match iter.peek() {
None => return bits,
Some(c) => c.is_digit(10),
};
if seeking_digits {
bits.push(take_while(&mut iter, |c| c.is_digit(10)));
} else {
bits.push(take_while(&mut iter, |c| !c.is_digit(10)));
}
}
}
fn take_while<I, F>(iter: &mut std::iter::Peekable<I>, predicate: F) -> String
where
I: Iterator<Item = char>,
F: Fn(&char) -> bool,
{
let mut out = String::new();
loop {
match iter.peek() {
Some(c) if predicate(c) => out.push(*c),
_ => return out,
}
let _ = iter.next();
}
}
fn main() {
println!("{:?}", split("test123test"));
}
This yields a solution with two levels of looping; another valid approach would be to model it as a state machine one level deep only. Ask if you aren’t sure what I mean and I’ll demonstrate.