How to find the last occurrence of a char in a string? - string

I want to find the index of the last forward slash / in a string. For example, I have the string /test1/test2/test3 and I want to find the location of the slash before test3. How can I achieve this?
In Python, I would use rfind but I can't find anything like that in Rust.

You need to use std::str::rfind. Note that it returns an Option<usize>, so you will need to account for that when checking its result:
fn main() {
let s = "/test1/test2/test3";
let pos = s.rfind('/');
println!("{:?}", pos); // prints "Some(12)"
}

#ljedrz's solution will not give you the correct result if your string contains non-ASCII characters.
Here is a slower solution but it will always give you correct answer:
let s = "/test1/test2/test3";
let pos = s.chars().count() - s.chars().rev().position(|c| c == '/').unwrap() - 1;
Or you can use this as a function:
fn rfind_utf8(s: &str, chr: char) -> Option<usize> {
if let Some(rev_pos) = s.chars().rev().position(|c| c == chr) {
Some(s.chars().count() - rev_pos - 1)
} else {
None
}
}

Related

How do I remove some chars at the end of a string?

I need to match a few words in the start of a string, handle it, than removes it. How should I remove few chars or bytes in then end of aString?
I using regex crate to match the string. I can't find a way to remove chars in the end of the String.
Maybe something like this, but have non-ASCII chars:
use lazy_static::lazy_static;
use regex::Regex;
fn func(s: &mut String) {
lazy_static! {
static ref RE: Regex = Regex::new(r"123").unwrap();
}
let cap = match RE.captures(s.as_str()) {
Some(v) => v.get(0).unwrap(),
None => panic!("Error"),
};
do_something(cap.as_str());
s.delete(0, cap.end());
}
fn do_something(s: &str) {
assert_eq!(s, "123")
}
fn main() {
let s = String::from("123456");
func(s);
assert_eq!(s, "456");
}
I have seen remove method, but it says it's O(n). If it is, I think O(nm) is a little bit too slow for me.
You can use regexes Match::start to get a start of the capture group.
You can then use truncate to get rid of everything after that.
fn main() {
let mut text: String = "this is a text with some garbage after!abc".into();
let re = regex::Regex::new("abc$").unwrap();
let m = re.captures(&text).unwrap();
let g = m.get(0).unwrap();
text.truncate(g.start());
dbg!(text);
}
What you're looking for is truncate - except with non-ascii support.
For ascii only, this works:
let mut s = String::from("123456789");
s.truncate(s.len() - 3);
assert_eq!(s, "123456");
However since String can contain unicode characters which aren't always 1 byte, it doesn't work for non-ascii (panics if the new length does not lie on a char boundary)
If you want non-ascii support, there isn't an O(1) solution according to this answer. That answer does give an implementation using char_indicies(), I think it's the best way unless I'm missing something.
There is also the unicode-truncate crate, which also seems to use char_indicies() - might be worth a look.

Difference between double quotes and single quotes in Rust

I was doing the adventofcode of 2020 day 3 in Rust to train a little bit because I am new to Rust and I my code would not compile depending if I used single quotes or double quotes on my "tree" variable
the first code snippet would not compile and throw the error: expected u8, found &[u8; 1]
use std::fs;
fn main() {
let text: String = fs::read_to_string("./data/text").unwrap();
let vec: Vec<&str> = text.lines().collect();
let vec_vertical_len = vec.len();
let vec_horizontal_len = vec[0].len();
let mut i_pointer: usize = 0;
let mut j_pointer: usize = 0;
let mut tree_counter: usize = 0;
let tree = b"#";
loop {
i_pointer += 3;
j_pointer += 1;
if j_pointer >= vec_vertical_len {
break;
}
let i_index = i_pointer % vec_horizontal_len;
let character = vec[j_pointer].as_bytes()[i_index];
if character == tree {
tree_counter += 1
}
}
println!("{}", tree_counter);
}
the second snippet compiles and gives the right answer..
use std::fs;
fn main() {
let text: String = fs::read_to_string("./data/text").unwrap();
let vec: Vec<&str> = text.lines().collect();
let vec_vertical_len = vec.len();
let vec_horizontal_len = vec[0].len();
let mut i_pointer: usize = 0;
let mut j_pointer: usize = 0;
let mut tree_counter: usize = 0;
let tree = b'#';
loop {
i_pointer += 3;
j_pointer += 1;
if j_pointer >= vec_vertical_len {
break;
}
let i_index = i_pointer % vec_horizontal_len;
let character = vec[j_pointer].as_bytes()[i_index];
if character == tree {
tree_counter += 1
}
}
println!("{}", tree_counter);
}
I did not find any reference explaining what is going on when using single or double quotes..can someone help me?
The short answer is it works similarly to java. Single quotes for characters and double quotes for strings.
let a: char = 'k';
let b: &'static str = "k";
The b'' or b"" prefix means take what I have here and interpret as byte literals instead.
let a: u8 = b'k';
let b: &'static [u8; 1] = b"k";
The reason strings result in references is due to how they are stored in the compiled binary. It would be too inefficient to store a string constant inside each method, so strings get put at the beginning of the binary in header area. When your program is being executed, you are taking a reference to the bytes in that header (hence the static lifetime).
Going further down the rabbit hole, single quotes technically hold a codepoint. This is essentially what you might think of as a character. So a Unicode character would also be considered a single codepoint even though it may be multiple bytes long. A codepoint is assumed to fit into a u32 or less so you can safely convert any char by using as u32, but not the other way around since not all u32 values will match valid codepoints. This also means b'\u{x}' is not valid since \u{x} may produce characters that will not fit within a single byte.
// U+1F600 is a unicode smiley face
let a: char = '\u{1F600}';
assert_eq!(a as u32, 0x1F600);
However, you might find it interesting to know that since Rust strings are stored as UTF-8, codepoints over 127 will occupy multiple bytes in a string despite fitting into a single byte on their own. As you may already know, UTF-8 is simply a way of converting codepoints to bytes and back again.
let foo: &'static str = "\u{1F600}";
let foo_chars: Vec<char> = foo.chars().collect();
let foo_bytes: Vec<u8> = foo.bytes().collect();
assert_eq!(foo_chars.len(), 1);
assert_eq!(foo_bytes.len(), 4);
assert_eq!(foo_chars[0] as u32, 0x1F600);
assert_eq!(foo_bytes, vec![240, 159, 152, 128]);

How to test if a string contains each character in a pattern in order?

I'm trying to port this Python function that returns true if each character in the pattern appears in the test string in order.
def substr_match(pattern, document):
p_idx, d_idx, p_len, d_len = 0, 0, len(pattern), len(document)
while (p_idx != p_len) and (d_idx != d_len):
if pattern[p_idx].lower() == document[d_idx].lower():
p_idx += 1
d_idx += 1
return p_len != 0 and d_len != 0 and p_idx == p_len
This is what I have at the moment.
fn substr_match(pattern: &str, document: &str) -> bool {
let mut pattern_idx = 0;
let mut document_idx = 0;
let pattern_len = pattern.len();
let document_len = document.len();
while (pattern_idx != pattern_len) && (document_idx != document_len) {
let pat: Vec<_> = pattern.chars().nth(pattern_idx).unwrap().to_lowercase().collect();
let doc: Vec<_> = document.chars().nth(document_idx).unwrap().to_lowercase().collect();
if pat == doc {
pattern_idx += 1;
}
document_idx += 1;
}
return pattern_len != 0 && document_len != 0 && pattern_idx == pattern_len;
}
I tried s.chars().nth(n) since Rust doesn't seem to allow string indexing, but I feel there is a more idiomatic way of doing it. What would be the preferred way of writing this in Rust?
Here is mine:
fn substr_match(pattern: &str, document: &str) -> bool {
let pattern_chars = pattern.chars().flat_map(char::to_lowercase);
let mut doc_chars = document.chars().flat_map(char::to_lowercase);
'outer: for p in pattern_chars {
for d in &mut doc_chars {
if d == p {
continue 'outer;
}
}
return false;
}
true
}
The other answers mimic the behavior of the Python function you started with, but it may be worth trying to make it better. I thought of two test cases where the original function may have surprising behavior:
>>> substr_match("ñ", "in São Paulo")
True
>>> substr_match("🇺🇸", "🇺🇦🇸🇰")
True
Hmm.
(The first example may depend on your input method; try copying and pasting. Also, if you can't see them, the special characters in the second example are flag emoji for the United States, Ukraine, and Slovakia.)
Without getting into why these tests fail or all the other things that could potentially be undesired, if you want to correctly handle Unicode text, you need to, at minimum, operate on graphemes instead of code points (this question describes the difference). Rust doesn't provide this feature in the standard library, so you need the unicode-segmentation crate, which provides a graphemes method on str.
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
fn substr_match(pattern: &str, document: &str) -> bool {
let mut haystack = document.graphemes(true);
pattern.len() > 0 && pattern.graphemes(true).all(|needle| {
haystack
.find(|grapheme| {
grapheme
.chars()
.flat_map(char::to_lowercase)
.eq(needle.chars().flat_map(char::to_lowercase))
})
.is_some()
})
}
Playground, test cases provided.
This algorithm takes advantage of several convenience methods on Iterator. all iterates over the pattern. find short-circuits, so whenever it finds the next needle in haystack, the next call to haystack.find will start at the following element.
(I thought this approach was somewhat clever, but honestly, a nested for loop is probably easier to read, so you might prefer that.)
The last "tricky" bit is case-insensitive string comparison, which is inherently language-dependent, but if you're willing to accept only unconditional mappings (those that apply in any language), char::to_lowercase does the trick. Rather than collect the result into a String, though, you can use Iterator::eq to compare the sequences of (lowercased) characters.
One other thing you may want to consider is Unicode normalization -- this question is a good place for the broad strokes. Fortunately, Rust has a unicode-normalization crate, too! And it looks quite easy to use. (You wouldn't necessarily want to use it in this function, though; instead, you might normalize all text on input so that you're dealing with the same normalization form everywhere in your program.)
str::chars() returns an iterator. Iterators return elements from a sequence one at a time. Specifically, str::chars() returns characters from a string one at a time. It's much more efficient to use a single iterator to iterate over a string than to create a new iterator each time you want to look up a character, because s.chars().nth(n) needs to perform a linear scan in order to find the nth character in the UTF-8 encoded string.
fn substr_match(pattern: &str, document: &str) -> bool {
let mut pattern_iter = pattern.chars();
let mut pattern_ch_lower: String = match pattern_iter.next() {
Some(ch) => ch,
None => return false,
}.to_lowercase().collect();
for document_ch in document.chars() {
let document_ch_lower: String = document_ch.to_lowercase().collect();
if pattern_ch_lower == document_ch_lower {
pattern_ch_lower = match pattern_iter.next() {
Some(ch) => ch,
None => return true,
}.to_lowercase().collect();
}
}
return false;
}
Here, I'm demonstrating two ways of using iterators:
To iterate over the pattern, I'm using the next method manually. next returns an Option: Some(value) if the iterator hasn't finished, or None if it has.
To iterate over the document, I'm using a for loop. The for loop does the work of calling next and unwrapping the result until next returns None.
One thing to notice is that I'm using a return expression inside a match expression (twice). Since a return expression doesn't produce a value, the compiler knows that its type doesn't matter. In this case, on the Some arm, the result is a char, so the whole match evaluates to a char.
We could also do this with two nested for loops:
fn substr_match(pattern: &str, document: &str) -> bool {
if pattern.len() == 0 {
return false;
}
let mut document_iter = document.chars();
for pattern_ch in pattern.chars() {
let pattern_ch_lower: String = pattern_ch.to_lowercase().collect();
for document_ch in &mut document_iter {
let document_ch_lower: String = document_ch.to_lowercase().collect();
if pattern_ch_lower == document_ch_lower {
break;
}
}
return false;
}
return true;
}
There are two things to notice here:
We need to handle the case where the pattern is empty without using the iterator.
In the inner loop, we don't want to restart from the start of the document when we move to the next pattern character, so we need to reuse the same iterator over the document. When we write for x in iter, the for loop takes ownership of iter; to avoid that, we must write &mut iter instead. Mutable references to iterators are iterators themselves, thanks to the blanket implementation impl<'a, I> Iterator for &'a mut I where I: Iterator + ?Sized in the standard library.

Find next char boundary index in string after char

Given the string s, and the index i which is where the 好 character starts:
let s = "abc 好 def";
let i = 4;
What's the best way to get the index after that character, so that I can slice the string and get abc 好? In code:
let end = find_end(s, i);
assert_eq!("abc 好", &s[0..end]);
(Note, + 1 doesn't work because that assumes that the character is only 1 byte long.)
I currently have the following:
fn find_end(s: &str, i: usize) -> usize {
i + s[i..].chars().next().unwrap().len_utf8()
}
But I'm wondering if I'm missing something and there's a better way?
You could use char_indices to get the next index rather than using len_utf8 on the character, though that has a special case for the last character.
I would use the handy str::is_char_boundary() method. Here's an implementation using that:
fn find_end(s: &str, i: usize) -> usize {
assert!(i < s.len());
let mut end = i+1;
while !s.is_char_boundary(end) {
end += 1;
}
end
}
Playground link
Normally I would make such a function return Option<usize> in case it's called with an index at the end of s, but for now I've just asserted.
In many cases, instead of explicitly calling find_end it may make sense to iterate using char_indices, which gives you each index along with the characters; though it's slightly annoying if you want to know the end of the current character.
To serve as a complement to #ChrisEmerson's answer, this is how one could implement a find_end that searches for the end of a character's first occurrence. Playground
fn find_end<'s>(s: &'s str, p: char) -> Option<usize> {
let mut indices = s.char_indices();
let mut found = false;
for (_, v) in &mut indices {
if v == p {
found = true;
break;
}
}
if found {
Some(indices.next()
.map_or_else(|| s.len(), |(i, _)| i))
} else {
None
}
}
Although it avoids the byte boundary loop, it is still not very elegant. Ideally, an iterator method for traversing until a predicate is met would simplify this.

How do I get the first character out of a string?

I want to get the first character of a std::str. The method char_at() is currently unstable, as is String::slice_chars.
I have come up with the following, but it seems excessive to get a single character and not use the rest of the vector:
let text = "hello world!";
let char_vec: Vec<char> = text.chars().collect();
let ch = char_vec[0];
UTF-8 does not define what "character" is so it depends on what you want. In this case, chars are Unicode scalar values, and so the first char of a &str is going to be between one and four bytes.
If you want just the first char, then don't collect into a Vec<char>, just use the iterator:
let text = "hello world!";
let ch = text.chars().next().unwrap();
Alternatively, you can use the iterator's nth method:
let ch = text.chars().nth(0).unwrap();
Bear in mind that elements preceding the index passed to nth will be consumed from the iterator.
I wrote a function that returns the head of a &str and the rest:
fn car_cdr(s: &str) -> (&str, &str) {
for i in 1..5 {
let r = s.get(0..i);
match r {
Some(x) => return (x, &s[i..]),
None => (),
}
}
(&s[0..0], s)
}
Use it like this:
let (first_char, remainder) = car_cdr("test");
println!("first char: {}\nremainder: {}", first_char, remainder);
The output looks like:
first char: t
remainder: est
It works fine with chars that are more than 1 byte.
Get the first single character out of a string w/o using the rest of that string:
let text = "hello world!";
let ch = text.chars().take(1).last().unwrap();
It would be nice to have something similar to Haskell's head function and tail function for such cases.
I wrote this function to act like head and tail together (doesn't match exact implementation)
pub fn head_tail<T: Iterator, O: FromIterator<<T>::Item>>(iter: &mut T) -> (Option<<T>::Item>, O) {
(iter.next(), iter.collect::<O>())
}
Usage:
// works with Vec<i32>
let mut val = vec![1, 2, 3].into_iter();
println!("{:?}", head_tail::<_, Vec<i32>>(&mut val));
// works with chars in two ways
let mut val = "thanks! bedroom builds YT".chars();
println!("{:?}", head_tail::<_, String>(&mut val));
// calling the function with Vec<char>
let mut val = "thanks! bedroom builds YT".chars();
println!("{:?}", head_tail::<_, Vec<char>>(&mut val));
NOTE: The head_tail function doesn't panic! if the iterator is empty. If this matched Haskell's head/tail output, this would have thrown an exception if the iterator was empty. It might also be good to use iterable trait to be more compatible to other types.
If you only want to test for it, you can use starts_with():
"rust".starts_with('r')
"rust".starts_with(|c| c == 'r')
I think it is pretty straight forward
let text = "hello world!";
let c: char = text.chars().next().unwrap();
next() takes the next item from the iterator
To “unwrap” something in Rust is to say, “Give me the result of the computation, and if there was an error, panic and stop the program.”
The accepted answer is a bit ugly!
let text = "hello world!";
let ch = &text[0..1]; // this returns "h"

Resources