Find next char boundary index in string after char - string

Given the string s, and the index i which is where the 好 character starts:
let s = "abc 好 def";
let i = 4;
What's the best way to get the index after that character, so that I can slice the string and get abc 好? In code:
let end = find_end(s, i);
assert_eq!("abc 好", &s[0..end]);
(Note, + 1 doesn't work because that assumes that the character is only 1 byte long.)
I currently have the following:
fn find_end(s: &str, i: usize) -> usize {
i + s[i..].chars().next().unwrap().len_utf8()
}
But I'm wondering if I'm missing something and there's a better way?

You could use char_indices to get the next index rather than using len_utf8 on the character, though that has a special case for the last character.
I would use the handy str::is_char_boundary() method. Here's an implementation using that:
fn find_end(s: &str, i: usize) -> usize {
assert!(i < s.len());
let mut end = i+1;
while !s.is_char_boundary(end) {
end += 1;
}
end
}
Playground link
Normally I would make such a function return Option<usize> in case it's called with an index at the end of s, but for now I've just asserted.
In many cases, instead of explicitly calling find_end it may make sense to iterate using char_indices, which gives you each index along with the characters; though it's slightly annoying if you want to know the end of the current character.

To serve as a complement to #ChrisEmerson's answer, this is how one could implement a find_end that searches for the end of a character's first occurrence. Playground
fn find_end<'s>(s: &'s str, p: char) -> Option<usize> {
let mut indices = s.char_indices();
let mut found = false;
for (_, v) in &mut indices {
if v == p {
found = true;
break;
}
}
if found {
Some(indices.next()
.map_or_else(|| s.len(), |(i, _)| i))
} else {
None
}
}
Although it avoids the byte boundary loop, it is still not very elegant. Ideally, an iterator method for traversing until a predicate is met would simplify this.

Related

Detect duplicated elements of a string slice happening in order

I need to detect and list string characters of slice that repeat themselves in order more or equal than N times. I managed to write non-higher-order-function solution in Rust already (below), but I wonder if this can be simplified to chaining iter methods.
The idea:
let v = "1122253225";
let n = 2;
Output:
There are 2 repetition of '1'
There are 3 repetition of '2'
There are 2 repetition of '2'
Indexes where repetition happens are not important. Repetition must happen in order (ie. 3 repetition of '2' separated by other values from the other 2 repetition of '2' counts as separate output lines).
My non-iterator solution:
let mut cur_ch = '\0';
let mut repeat = 0;
for ch in v.chars() {
if ch == cur_ch {
repeat = repeat + 1;
}
else {
if repeat >= n {
printf!("There are {} repetition of '{}'", repeat, cur_ch);
}
cur_ch = ch;
repeat = 1;
}
}
if repeat >= n {
printf!("There are {} repetition of '{}'", repeat, cur_ch);
}
It works, but is there a better way to do so with chaining iter methods?
Here is a solution that uses scan and filter_map:
fn main() {
let s = "112225322555";
let n = 2;
let i = s
.chars()
.map(|v| Some(v))
.chain(std::iter::once(None))
.scan((0, None), |(count, ch), v| match ch {
Some(c) if *c == v => {
*count += 1;
Some((None, *count))
}
_ => Some((ch.replace(v), std::mem::replace(count, 1))),
})
.filter_map(|(ch, count)| match ch {
Some(Some(ch)) if count >= n => Some((ch, count)),
_ => None,
});
for (ch, num) in i {
println!("There are {} repititions of {}", num, ch);
}
}
Playground Link
The first step is to use scan to count the number of adjacent characters. The first argument to scan is a state variable, which gets passed to each call of the closure that you pass as the second argument. In this case the state variable is a tuple containing the current character and the number of times it has been seen.
Note:
We need to extend the iteration one beyond the end of the string we are analyzing (otherwise we would miss the case where the end of the string contained a run of characters meeting the criteria). We do this by mapping the iteration into Option<char> and then chaining on a single None. This is better than special-casing a character such as \0, so that we could even count \0 characters.
For the same reason, we use Option<char> as the current character within the state tuple.
The return value of scan is an iterator over (Option<Option<char>>, i32). The first value in the tuple will be None for each repeated character in the iterator, whereas at each boundary where the character changes it will be Some(Some(char))
We use replace to both return the current character and count, at the same time as setting the state tuple to new values
The last step is to use filter_map to:
remove the (None, i32) variants, which indicate no change in the incoming character
filter out the cases where the count does not reach the limit n.
Here's one attempt at using filter_map():
fn foo(v: &str, n: usize) -> impl Iterator<Item = (usize, char)> + '_ {
let mut cur_ch = '\0';
let mut repeat = 0;
v.chars().chain(std::iter::once('\0')).filter_map(move |ch| {
if ch == cur_ch {
repeat += 1;
return None;
}
let val = if repeat >= n {
Some((repeat, cur_ch))
} else {
None
};
cur_ch = ch;
repeat = 1;
val
})
}
fn main() {
for (repeat, ch) in foo("1122253225", 2) {
println!("There are {} repetition of '{}'", repeat, ch);
}
}
And then you can generalize this to something like this:
fn foo<'i, I, T>(v: I, n: usize) -> impl Iterator<Item = (usize, T)> + 'i
where
I: Iterator<Item = T> + 'i,
T: Clone + Default + PartialEq + 'i,
{
let mut cur = T::default();
let mut repeat = 0;
v.chain(std::iter::once(T::default()))
.filter_map(move |i| {
if i == cur {
repeat += 1;
return None;
}
let val = if repeat >= n {
Some((repeat, cur.clone()))
} else {
None
};
cur = i;
repeat = 1;
val
})
}
This would be higher-order, but not sure if it's actually much simpler than just using a for loop!

How do I get the index from the beginning in a reversed iterator of chars of a string?

This code:
let s = String::from("hi");
for (idx, ch) in s.chars().rev().enumerate() {
println!("{} {}", idx, ch);
}
prints
0 i
1 h
but I want to know the real index, so that it would print:
1 i
0 h
What's the best way to do that? Currently I only think of first getting .count() and subtracting each idx from it, but maybe there's a better method that I overlooked.
This is complicated, as they say. If your string is ASCII only, you can do the obvious enumeration then reverse against a String's byte iterator:
fn main() {
let s = String::from("hi");
for (idx, ch) in s.bytes().enumerate().rev() {
println!("{} {}", idx, ch as char);
}
}
This doesn't work for Unicode strings in general because of what a char in Rust stands for:
The char type represents a single character. More specifically, since 'character' isn't a well-defined concept in Unicode, char is a 'Unicode scalar value', which is similar to, but not the same as, a 'Unicode code point'.
This can be illustrated by the following:
fn main() {
let s = String::from("y̆");
println!("{}", s.len());
for (idx, ch) in s.bytes().enumerate() {
println!("{} {}", idx, ch);
}
for (idx, ch) in s.chars().enumerate() {
println!("{} {}", idx, ch);
}
}
This weird looking string has length of 3, as in 3 u8s. At the same time it has 2 chars. So ExactSizeIterator can't be trivially implemented for std::str::Chars, but it can and does be implemented for std::str::Bytes. This is significant because to reverse a given iterator, it has to be DoubleEndedIterator:
fn rev(self) -> Rev<Self>
where
Self: DoubleEndedIterator,
But DoubleEndedIterator is only available for enumeration iterator if the underlying iterator is also ExactSizeIterator:
impl<I> DoubleEndedIterator for Enumerate<I>
where
I: ExactSizeIterator + DoubleEndedIterator,
In conclusion, you can only do s.bytes().enumerate().rev(), but not s.chars().enumerate().rev(). If you absolutely have to index the enumerated char iterator of a String that way, you are on your own.

How to find the starting offset of a string slice of another string? [duplicate]

This question already has answers here:
How to get the byte offset between `&str`
(2 answers)
Closed 3 years ago.
Given a string and a slice referring to some substring, is it possible to find the starting and ending index of the slice?
I have a ParseString function which takes in a reference to a string, and tries to parse it according to some grammar:
ParseString(inp_string: &str) -> Result<(), &str>
If the parsing is fine, the result is just Ok(()), but if there's some error, it usually is in some substring, and the error instance is Err(e), where e is a slice of that substring.
When given the substring where the error occurs, I want to say something like "Error from characters x to y", where x and y are the starting and ending indices of the erroneous substring.
I don't want to encode the position of the errors directly in Err, because I'm nesting these invocations, and the offsets in the nested slice might not correspond to the some slice in the top level string.
As long as all of your string slices borrow from the same string buffer, you can calculate offsets with simple pointer arithmetic. You need the following methods:
str::as_ptr(): Returns the pointer to the start of the string slice
A way to get the difference between two pointers. Right now, the easiest way is to just cast both pointers to usize (which is always a no-op) and then subtract those. On 1.47.0+, there is a method offset_from() which is slightly nicer.
Here is working code (Playground):
fn get_range(whole_buffer: &str, part: &str) -> (usize, usize) {
let start = part.as_ptr() as usize - whole_buffer.as_ptr() as usize;
let end = start + part.len();
(start, end)
}
fn main() {
let input = "Everyone ♥ Ümläuts!";
let part1 = &input[1..7];
println!("'{}' has offset {:?}", part1, get_range(input, part1));
let part2 = &input[7..16];
println!("'{}' has offset {:?}", part2, get_range(input, part2));
}
Rust actually used to have an unstable method for doing exactly this, but it was removed due to being obsolete, which was a bit odd considering the replacement didn't remotely have the same functionality.
That said, the implementation isn't that big, so you can just add the following to your code somewhere:
pub trait SubsliceOffset {
/**
Returns the byte offset of an inner slice relative to an enclosing outer slice.
Examples
```ignore
let string = "a\nb\nc";
let lines: Vec<&str> = string.lines().collect();
assert!(string.subslice_offset_stable(lines[0]) == Some(0)); // &"a"
assert!(string.subslice_offset_stable(lines[1]) == Some(2)); // &"b"
assert!(string.subslice_offset_stable(lines[2]) == Some(4)); // &"c"
assert!(string.subslice_offset_stable("other!") == None);
```
*/
fn subslice_offset_stable(&self, inner: &Self) -> Option<usize>;
}
impl SubsliceOffset for str {
fn subslice_offset_stable(&self, inner: &str) -> Option<usize> {
let self_beg = self.as_ptr() as usize;
let inner = inner.as_ptr() as usize;
if inner < self_beg || inner > self_beg.wrapping_add(self.len()) {
None
} else {
Some(inner.wrapping_sub(self_beg))
}
}
}
You can remove the _stable suffix if you don't need to support old versions of Rust; it's just there to avoid a name conflict with the now-removed subslice_offset method.

How to find the last occurrence of a char in a string?

I want to find the index of the last forward slash / in a string. For example, I have the string /test1/test2/test3 and I want to find the location of the slash before test3. How can I achieve this?
In Python, I would use rfind but I can't find anything like that in Rust.
You need to use std::str::rfind. Note that it returns an Option<usize>, so you will need to account for that when checking its result:
fn main() {
let s = "/test1/test2/test3";
let pos = s.rfind('/');
println!("{:?}", pos); // prints "Some(12)"
}
#ljedrz's solution will not give you the correct result if your string contains non-ASCII characters.
Here is a slower solution but it will always give you correct answer:
let s = "/test1/test2/test3";
let pos = s.chars().count() - s.chars().rev().position(|c| c == '/').unwrap() - 1;
Or you can use this as a function:
fn rfind_utf8(s: &str, chr: char) -> Option<usize> {
if let Some(rev_pos) = s.chars().rev().position(|c| c == chr) {
Some(s.chars().count() - rev_pos - 1)
} else {
None
}
}

How to test if a string contains each character in a pattern in order?

I'm trying to port this Python function that returns true if each character in the pattern appears in the test string in order.
def substr_match(pattern, document):
p_idx, d_idx, p_len, d_len = 0, 0, len(pattern), len(document)
while (p_idx != p_len) and (d_idx != d_len):
if pattern[p_idx].lower() == document[d_idx].lower():
p_idx += 1
d_idx += 1
return p_len != 0 and d_len != 0 and p_idx == p_len
This is what I have at the moment.
fn substr_match(pattern: &str, document: &str) -> bool {
let mut pattern_idx = 0;
let mut document_idx = 0;
let pattern_len = pattern.len();
let document_len = document.len();
while (pattern_idx != pattern_len) && (document_idx != document_len) {
let pat: Vec<_> = pattern.chars().nth(pattern_idx).unwrap().to_lowercase().collect();
let doc: Vec<_> = document.chars().nth(document_idx).unwrap().to_lowercase().collect();
if pat == doc {
pattern_idx += 1;
}
document_idx += 1;
}
return pattern_len != 0 && document_len != 0 && pattern_idx == pattern_len;
}
I tried s.chars().nth(n) since Rust doesn't seem to allow string indexing, but I feel there is a more idiomatic way of doing it. What would be the preferred way of writing this in Rust?
Here is mine:
fn substr_match(pattern: &str, document: &str) -> bool {
let pattern_chars = pattern.chars().flat_map(char::to_lowercase);
let mut doc_chars = document.chars().flat_map(char::to_lowercase);
'outer: for p in pattern_chars {
for d in &mut doc_chars {
if d == p {
continue 'outer;
}
}
return false;
}
true
}
The other answers mimic the behavior of the Python function you started with, but it may be worth trying to make it better. I thought of two test cases where the original function may have surprising behavior:
>>> substr_match("ñ", "in São Paulo")
True
>>> substr_match("🇺🇸", "🇺🇦🇸🇰")
True
Hmm.
(The first example may depend on your input method; try copying and pasting. Also, if you can't see them, the special characters in the second example are flag emoji for the United States, Ukraine, and Slovakia.)
Without getting into why these tests fail or all the other things that could potentially be undesired, if you want to correctly handle Unicode text, you need to, at minimum, operate on graphemes instead of code points (this question describes the difference). Rust doesn't provide this feature in the standard library, so you need the unicode-segmentation crate, which provides a graphemes method on str.
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
fn substr_match(pattern: &str, document: &str) -> bool {
let mut haystack = document.graphemes(true);
pattern.len() > 0 && pattern.graphemes(true).all(|needle| {
haystack
.find(|grapheme| {
grapheme
.chars()
.flat_map(char::to_lowercase)
.eq(needle.chars().flat_map(char::to_lowercase))
})
.is_some()
})
}
Playground, test cases provided.
This algorithm takes advantage of several convenience methods on Iterator. all iterates over the pattern. find short-circuits, so whenever it finds the next needle in haystack, the next call to haystack.find will start at the following element.
(I thought this approach was somewhat clever, but honestly, a nested for loop is probably easier to read, so you might prefer that.)
The last "tricky" bit is case-insensitive string comparison, which is inherently language-dependent, but if you're willing to accept only unconditional mappings (those that apply in any language), char::to_lowercase does the trick. Rather than collect the result into a String, though, you can use Iterator::eq to compare the sequences of (lowercased) characters.
One other thing you may want to consider is Unicode normalization -- this question is a good place for the broad strokes. Fortunately, Rust has a unicode-normalization crate, too! And it looks quite easy to use. (You wouldn't necessarily want to use it in this function, though; instead, you might normalize all text on input so that you're dealing with the same normalization form everywhere in your program.)
str::chars() returns an iterator. Iterators return elements from a sequence one at a time. Specifically, str::chars() returns characters from a string one at a time. It's much more efficient to use a single iterator to iterate over a string than to create a new iterator each time you want to look up a character, because s.chars().nth(n) needs to perform a linear scan in order to find the nth character in the UTF-8 encoded string.
fn substr_match(pattern: &str, document: &str) -> bool {
let mut pattern_iter = pattern.chars();
let mut pattern_ch_lower: String = match pattern_iter.next() {
Some(ch) => ch,
None => return false,
}.to_lowercase().collect();
for document_ch in document.chars() {
let document_ch_lower: String = document_ch.to_lowercase().collect();
if pattern_ch_lower == document_ch_lower {
pattern_ch_lower = match pattern_iter.next() {
Some(ch) => ch,
None => return true,
}.to_lowercase().collect();
}
}
return false;
}
Here, I'm demonstrating two ways of using iterators:
To iterate over the pattern, I'm using the next method manually. next returns an Option: Some(value) if the iterator hasn't finished, or None if it has.
To iterate over the document, I'm using a for loop. The for loop does the work of calling next and unwrapping the result until next returns None.
One thing to notice is that I'm using a return expression inside a match expression (twice). Since a return expression doesn't produce a value, the compiler knows that its type doesn't matter. In this case, on the Some arm, the result is a char, so the whole match evaluates to a char.
We could also do this with two nested for loops:
fn substr_match(pattern: &str, document: &str) -> bool {
if pattern.len() == 0 {
return false;
}
let mut document_iter = document.chars();
for pattern_ch in pattern.chars() {
let pattern_ch_lower: String = pattern_ch.to_lowercase().collect();
for document_ch in &mut document_iter {
let document_ch_lower: String = document_ch.to_lowercase().collect();
if pattern_ch_lower == document_ch_lower {
break;
}
}
return false;
}
return true;
}
There are two things to notice here:
We need to handle the case where the pattern is empty without using the iterator.
In the inner loop, we don't want to restart from the start of the document when we move to the next pattern character, so we need to reuse the same iterator over the document. When we write for x in iter, the for loop takes ownership of iter; to avoid that, we must write &mut iter instead. Mutable references to iterators are iterators themselves, thanks to the blanket implementation impl<'a, I> Iterator for &'a mut I where I: Iterator + ?Sized in the standard library.

Resources