How to get a substring of a &str based on character index?

How to get a substring of a &str based on character index? - string

I am trying to write a program that takes a list of words and then, if the word has an even length, prints the two middle letters. If the word has an odd length, it prints the single middle letter.
I can find the index of the middle letter(s), but I do not know how to use that index to print the corresponding letters of the word.
fn middle(wds: &[&str)){
for word in wds{
let index = words.chars().count() /2;
match words.chars().count() % 2{
0 => println!("Even word found"),
_ => println!("odd word found")
}
}
}
fn main(){
let wordlist = ["Some","Words","to","test","testing","elephant","absolute"];
middle(&wordlist);
}

You can use slices for this, specifically &str slices. Note the &.
These links might be helpful:
https://riptutorial.com/rust/example/4146/string-slicing
https://doc.rust-lang.org/book/ch04-03-slices.html
fn main() {
let s = "elephant";
let mid = s.len() / 2;
let sliced = &s[mid - 1..mid + 1];
println!("{}", sliced);
}

Hey after posting i found two different ways of doing it, the fact i had two seperate ways in my head was confusing me and stopping me finding the exact answer.
//i fixed printing the middle letter of the odd numbered string with
word.chars().nth(index).unwrap()
//to fix the even index problem i did
&word[index-1..index+1]

Related

taking only a int from a text with int string in Rust

i need to take only the integer from a string like this "Critical: 3\r\n" , note that the value change everytime so i can't search for "3", i need to search for a generic int.
Thanks.

Many ways to do it. There are already some answers. Here is one more approach:
let s = "Critical: 3\r\n";
let s_res = s.split(":").collect::<Vec<&str>>()[1].trim();
println!("s_res = {s_res:?}"); // "3"
In the above code s_res will be a string (&str). To convert that string to an integer, you can do something like this:
let n: isize = s_res.parse().expect("Failed to parse the integer!");
println!("n = {n}"); // 3
Note that, depending on your needs, you might want to add some extra validations/asserts, in case you expect the pattern might change (for example, the number of colons not to be 1, etc.).

Building on #AlexanderKrauze's comment the most common way to do so is using a regex, which lets you look for any pattern in a String:
let your_text = "Critical: 3\r\n";
let re = Regex::new(r"\d+").unwrap(); // matches any amount of consecutive digits
let result:Option<Match> = re.find(your_text);// returns the match
let number:u32 = result.map(|m| m.as_str().parse::<u32>().unwrap()).unwrap_or(0); // converts to int
print!("{}", number);
would be the code for that. Only one digit is r"\d".
More documentation is found here.

You can use chars to get an iterator over the chars of a string, and then apply filter on that iterator to filter out only digits(is_digit).
fn main() {
let my_str: String = "Critical: 3\r\n".to_owned();
let digits: String = my_str.chars().filter(|char| char.is_digit(10)).collect();
println!("{}", digits)
}

Why is in this rust tutorial the string beeing converted to bytes?

I am reading the rust tutorial and in this section the tutorial converts a string into a byte array
like so:
fn first_word(s: &String) -> usize {
let bytes = s.as_bytes();
for (i, &item) in bytes.iter().enumerate() {
if item == b' ' {
return i;
}
}
s.len()
}
They state that this conversion is because we want to find the first instance of the space character so we need to compare to it. My question is why do we need to convert to bytes? What if instead of converting the string to bytes we convert the byte ' ' into a String and compare to that?

Strings in Rust are UTF-8 encoded. You can iterate over chars but that will be a bit slower because Unicode code points are variable length, and the char type is 4 bytes long so you can't fit as many in a cache line.
A space has the same byte representation regardless of whether you are using ASCII or UTF-8 encoding, so this is an easy optimisation. It's also the same amount of code as iterating over chars.
But, probably more importantly, the function in question is returning an index for where the character is found. Finding the index by iterating over chars would tell you how many unicode code points to skip to get to the that position, but you'd have to iterate again each time you wanted to use the index because each preceding codepoint could be anywhere from 1 to 4 bytes long. An index into bytes is much more straightforward and efficient.
For example, with a byte index:
let words = String::from("Hello there");
let index = first_word(&words); // byte index
// just a slice
let first_word = str::from_utf8(&words.as_bytes()[0..index]).unwrap();
Indexing Unicode code points:
let words = String::from("Hello there");
let index = first_word(&words); // code point index
// having to iterate again, and allocate a new String
let first_word: String = words.chars().take(index).collect();
Any method to take a slice here would involve calculating the byte position first.

Slice a string containing Unicode chars

I have a piece of text with characters of different bytelength.
let text = "Hello привет";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'
I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end] ?
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?

Possible solutions to codepoint slicing
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello привет";
println!("{}", &text[2..10]);
This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):
let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("{}", &text[2..end]);
As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("{}", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "Jürgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A LATIN CAPITAL LETTER J
0x0075 LATIN SMALL LETTER U
0x0308 COMBINING DIAERESIS
...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "ﬁre"
>>> s[0:2]
'ﬁr'
Also not what you'd expect. This time, fi is actually the ligature ﬁ, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.
Further resources on this topic:
Blogpost "Let's stop ascribing meaning to unicode codepoints"
Blogpost "Breaking our Latin-1 assumptions
http://utf8everywhere.org/

A UTF-8 encoded string may contain characters which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices() for solving this (remember, that getting to a position in a UTF-8 string is O(n)):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> {
assert!(end >= start);
string.char_indices().nth(start).and_then(|(start_pos, _)| {
string[start_pos..]
.char_indices()
.nth(end - start - 1)
.map(|(end_pos, _)| &string[start_pos..end_pos])
})
}
playground
You may use str::chars() if you are fine with getting a String:
let string: String = text.chars().take(end).skip(start).collect();

Here is a function which retrieves a utf8 slice, with the following pros:
handle all edge cases (empty input, 0-width output ranges, out-of-scope ranges);
never panics;
use start-inclusive, end-exclusive ranges.
pub fn utf8_slice(s: &str, start: usize, end: usize) -> Option<&str> {
let mut iter = s.char_indices()
.map(|(pos, _)| pos)
.chain(Some(s.len()))
.skip(start)
.peekable();
let start_pos = *iter.peek()?;
for _ in start..end { iter.next(); }
Some(&s[start_pos..*iter.peek()?])
}

Checking a byte inside of a loop doesn't work

I'm building a translator that should convert words into Pig Latin (i.e. the word apple to apple-hay or word happy to appy-fay). If the word begins with vowel, it doesn't drop it and adds "-hay" to its end, and drops the first letter if it is consonant and adds "-fay" to the end:
use std::str;
fn main() {
// The case when it works perfectly well
let dict = String::from("Hello").into_bytes();
let vowels: Vec<u8> = vec![b'a', b'e', b'i', b'o', b'u'];
let mut result = String::new();
for c in vowels.iter() {
if &dict[0] == c {
result = str::from_utf8(&dict).unwrap().to_owned() + "-hay ";
} else {
result = str::from_utf8(&dict[1..]).unwrap().to_owned() + "-fay";
}
}
println!("{}", result);
}
The code compiles without any errors or warnings and if I pass a string that begins with consonant it works perfectly well. However, when I pass a string that starts with a vowel, e.g. apple, the function behaves just like it began from a consonant and still performs actions from the else block. What is my error here?

You need to break once you find a matching vowel... Otherwise, unless the first letter of the string happens to be the last vowel in your set, once you match it, you'll
continue to compare the first letter of the string against every other vowel,
find that it doesn't match, because obviously a letter can't be two or more vowels at once, and hence
conclude - potentially multiple times, and finally - that the first letter of the string is a consonant.
Anyway, that should be a separate function, not main(), and just return once you find a match, so you won't need a result variable or break.

Accessing a character in a borrowed string by index

How would you access an element in a borrowed string by index?
Straightforward in Python:
my_string_lst = list(my_string)
print my_string_list[0]
print my_string[0] # same as above
Rust (attempt 1):
let my_string_vec = vec![my_string]; # doesn't work
println!("{}", my_string_vec[0]); # returns entire of `my_string`
Rust (attempt 2):
let my_string_vec = my_string.as_bytes(); # returns a &[u8]
println!("{}", my_string_vec[0]); # prints nothing
My end goal is to stick it into a loop like this:
for pos in 0..my_string_vec.len() {
while shift <= pos && my_string_vec[pos] != my_string_vec[pos-shift] {
shift += shifts[pos-shift];
}
shifts[pos+1] = shift;
}
for ch in my_string_vec {
let pos = 0; // simulate some runtime index
if my_other_string_vec[pos] != ch {
...
}
}
I think it's possible to do use my_string_vec.as_bytes()[pos]and my_string_vec.as_bytes()[pos-shift]in my condition statement, but I feel that this has a bad code smell.

You can use char_at(index) to access a specific character. If you want to iterate over the characters in a string, you can use the chars() method which yields an iterator over the characters in the string.
The reason it was specifically not made possible to use indexing syntax is, IIRC, because indexing syntax would give the impression that it was like accessing a character in your typical C-string-like string, where accessing a character at a given index is a constant time operation (i.e. just accessing a single byte in an array). Strings in Rust, on the other hand, are Unicode and a single character may not necessarily consist of just one byte, making a specific character access a linear time operation, so it was decided to make that performance difference explicit and clear.
As far as I know, there is no method available for swapping characters in a string (see this question). Note that this wouldn't have been possible anyways via an immutably borrowed string, since such a string isn't yours to modify. You would have to most likely use a String, or perhaps a &mut str if you're strictly swapping, but I'm not too familiar with Unicode's intricacies.
I recommend instead you build up a String the way you want it, that way you don't have to worry about the mutability of the borrowed string. You'd refer/look into the borrowed string, and write into the output/build-up string accordingly based on your logic.
So this:
for pos in 0..my_string_vec.len() {
while shift <= pos && my_string_vec[pos] != my_string_vec[pos-shift] {
shift += shifts[pos-shift];
}
shifts[pos+1] = shift;
}
Might become something like this (not tested; not clear what your logic is for):
for ch in my_string.chars()
while shift <= pos && ch != my_string.char_at(pos - shift) {
// assuming shifts is a vec; not clear in question
shift += shifts[pos - shift];
}
shifts.push(shift);
}
Your last for loop:
for ch in my_string_vec {
let pos = 0; // simulate some runtime index
if my_other_string_vec[pos] != ch {
...
}
}
That kind of seems like you want to compare a given character in string A with the corresponding character (in the same position) of string B. For this I would recommend zipping the chars iterator of the first with the second, something like:
for (left, right) in my_string.chars().zip(my_other_string.chars()) {
if left != right {
}
}
Note that zip() stops iterating as soon as either iterator stops, meaning that if the strings are not the same length, then it'll only go as far as the shortest string.
If you need access to the "character index" information, you could add .enumerate() to that, so the above would change to:
for (index, (left, right)) in my_string.chars().zip(my_other_string.chars()).enumerate()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to get a substring of a &str based on character index? - string

Related

taking only a int from a text with int string in Rust

Why is in this rust tutorial the string beeing converted to bytes?

Slice a string containing Unicode chars

Checking a byte inside of a loop doesn't work

Accessing a character in a borrowed string by index

Categories

Resources