Split string and skip empty substrings - string

I am learning Rust and discovered this problem:
I would like to split a string by a pattern and remove all cases where the resultin substring is empty.
Here is an example:
let s = "s,o,m,e,";
for elem in s.split(",").skip_while(|&x| x.is_empty()) {
print!(" <{}> ", elem);
//print!(" <{}>({}) ", elem, elem.is_empty());
}
But the result is the following:
<s> <o> <m> <e> <>
My thoughts were: The struct Split returned by split implements Iterator which provides skip_while. IntelliSense told me the x in the closure is of type &&str so I would assume all the elements of the iterator (of type &str) which are empty to be omitted.
But it doesn't skip the empty substring.
I also tried to print the result of the is_empty function. It shows that the last slice is indeed empty. If I instead for the skip_while use skip_while(|&x| x == "s"), it correctly leaves out the "s" (printed with is_empty here):
<o>(false) <m>(false) <e>(false) <>(true)
So somehow the slice behaves differently in the iterator?
Why is that or where am I mistaken?

If you only need to omit the 1 empty string at the end of input, just use split_terminator instead of split. This adapter is basically just like split but it treats the pattern argument as a terminator instead of a separator, so the empty string at the end of input is not considered a new element.
If you truly want to skip all the empty strings, keep reading.
skip_while is doing exactly as its documentation says (emphasis mine):
skip_while() takes a closure as an argument. It will call this closure on each element of the iterator, and ignore elements until it returns false.
After false is returned, skip_while()'s job is over, and the rest of the elements are yielded.
Filtering out all the elements that match a predicate, regardless of where they are in the sequence, is the job of filter:
let s = ",s,o,,m,e,";
for elem in s.split(",").filter(|&x| !x.is_empty()) {
print!(" <{}> ", elem);
}
The above code will have the effect you wanted.
Note that the predicate to filter has the opposite meaning to skip_while: instead of returning true for elements that should not be yielded, it returns true for elements that should be yielded.

skip_while ignore elements until it returns false, that is it will take all elements after the first non empty string in your case. For instance, given:
let s = ",,s,o,m,e,";
skip_while will ignore the first two empty strings but keep the last one:
let s = ",,s,o,m,e,";
for elem in s.split(",").skip_while(|&x| x.is_empty()) {
print!(" <{}> ", elem);
}
Will print:
<s> <o> <m> <e> <>
For your case, it seems you just need filter:
let s = "s,o,m,e,";
for elem in s.split(",").filter(|&x| !x.is_empty()) {
print!(" <{}> ", elem);
}
Playground

Related

Rust reconstitute format=flowed emails, or an iterator that combines some elements of the previous iterator

Currently I have a program that is reading some emails from disk, and parsing some included text (that is csv-like, although happens to be fixed-width fields and '|' separated.
The emails are not particularly huge, so I fs::read_to_string them into a string (in a loop), and for each one use .split("\n") to iterate over lines, then run a constructor on each line to create a struct for each valid csv-like line.
So like
let mut hostiter = text.split("\n")
.filter_map(|x| HostInfo::from_str(x));
Where HostInfo has owned values, copying from the &str references.
This all works fine as is, but now I want to be able to handle emails that quote the records I'm looking for (i.e. lines that start with "> > "). That's easy enough:
let quotes = &['>', ' '];
let mut hostiter = text.split("\n")
.map(|x| x.trim_start_matches(quotes))
.filter_map(|x| HostInfo::from_str(x));
I also need to cope with rfc3676/format=flowed emails? This means that, when forwarded/replied to, email clients split the lines so that each record I'm looking for is split over 2 or more lines. Continuation lines are delineated with " \r\n", i.e. it has a space before the cr/newline. Non-continuation lines have the "\r\n" after a non-space character. (Currently my code skips these partial records.) I need an iterator that iterates over complete lines. I'm thinking of two ways of doing this:
The easiest may be to split the string (on '\n'), trim the starts of any quoting, then collect the string into a new string with '\n' separating to remove the quotes. Then a second pass to replace all " \r\n" with ' ' again producing a new string. Now I have a string that can be split on '\n' and has complete records.
Else is there an iterator adapter I can use that will combine elements if they are continuation lines? e.g. can I use group_by to group lines with their continuation lines?
I realize I can't have an iterator that returns complete records as a single &str (unless I do 1.), since the records are split in the original string. However I can refactor my constructor to take a vector of &str instead of a single &str.
In the end I used coalesce to group the lines. Since the items I'm iterating over are &str which can't be joined without allocation I decided to store the output as Vec<&str>. Since coalesce wants the same types as input and output (why?), I needed to convert the &str to single item vectors before using it. The resulting code was:
let mut hostiter = text.split("\r\n")
.map(|x| vec![x.trim_start_matches(quotes)])
.coalesce(|mut x, mut y| match o.flowed && x[x.len()-1].ends_with(' ') {
true => { x.append(&mut y); Ok( x )},
false => Err( (x,y) ),
})
.filter_map(|x| HostInfo::from_vec_str(x);
(o.flowed is a flag indicating whether we picked up a Content type: with format=flowed in the headers of the email.)
I had to convert my HostInfo::from_str function to HostInfo::from_vec_str to take a Vec<&str> instead of a &str. Since my from_str function splits the &str on spaces anyway, it was easy enough to use flat_map to split each &str in the Vec and output words...
Not sure if coalesce is the best way to do this. I was looking for an iterator adaptor that would take a closure that takes a collection and an item, and returns a bool; I.e. does this item belong with the other items in this collection? The iterator adaptor output would iterate over collections of items.

Need to extract the last word in a Rust string

I am doing some processing of a string in Rust, and I need to be able to extract the last set of characters from that string. In other words, given a string like the following:
some|not|necessarily|long|name
I need to be able to get the last part of that string, namely "name" and put it into another String or a &str, in a manner like:
let last = call_some_function("some|not|necessarily|long|name");
so that last becomes equal to "name".
Is there a way to do this? Is there a string function that will allow this to be done easily? If not (after looking at the documentation, I doubt that there is), how would one do this in Rust?
While the answer from #effect is correct, it is not the most idiomatic nor the most performant way to do it. It'll walk the entire string and match all of the |s to reach the last. You can make it better, but there is a method of str that does exactly what you want - rsplit_once():
let (_, name) = s.rsplit_once('|').unwrap();
// Or
// let name = s.rsplit_once('|').unwrap().1;
//
// You can also use a multichar separator:
// let (_, name) = s.rsplit_once("|").unwrap();
// But in the case of a single character, a `char` type is likely to be more performant.
Playground.
You can use the String::split() method, which will return an iterator over the substrings split by that separator, and then use the Iterator::last() method to return the last element in the iterator, like so:
let s = String::from("some|not|necessarily|long|name");
let last = s.split('|').last().unwrap();
assert_eq!(last, "name");
Please also note that string slices (&str) also implement the split method, so you don't need to use std::String.
let s = "some|not|necessarily|long|name";
let last = s.split('|').last().unwrap();
assert_eq!(last, "name");

Checking a byte inside of a loop doesn't work

I'm building a translator that should convert words into Pig Latin (i.e. the word apple to apple-hay or word happy to appy-fay). If the word begins with vowel, it doesn't drop it and adds "-hay" to its end, and drops the first letter if it is consonant and adds "-fay" to the end:
use std::str;
fn main() {
// The case when it works perfectly well
let dict = String::from("Hello").into_bytes();
let vowels: Vec<u8> = vec![b'a', b'e', b'i', b'o', b'u'];
let mut result = String::new();
for c in vowels.iter() {
if &dict[0] == c {
result = str::from_utf8(&dict).unwrap().to_owned() + "-hay ";
} else {
result = str::from_utf8(&dict[1..]).unwrap().to_owned() + "-fay";
}
}
println!("{}", result);
}
The code compiles without any errors or warnings and if I pass a string that begins with consonant it works perfectly well. However, when I pass a string that starts with a vowel, e.g. apple, the function behaves just like it began from a consonant and still performs actions from the else block. What is my error here?
You need to break once you find a matching vowel... Otherwise, unless the first letter of the string happens to be the last vowel in your set, once you match it, you'll
continue to compare the first letter of the string against every other vowel,
find that it doesn't match, because obviously a letter can't be two or more vowels at once, and hence
conclude - potentially multiple times, and finally - that the first letter of the string is a consonant.
Anyway, that should be a separate function, not main(), and just return once you find a match, so you won't need a result variable or break.

How to check if two strings can be made equal by using recursion?

I am trying to practice recursion, but at the moment I don't quite understand it well...
I want to write a recursive Boolean function which takes 2 strings as arguments, and returns true if the second string can be made equal to the first by replacing some letters with a certain special character.
I'll demonstrate what I mean:
Let s1 = "hello", s2 = "h%lo", where '%' is the special character.
The function will return true since '%' can replace "el", causing the two strings to be equal.
Another example:
Let s1 = "hello", s2 = "h%l".
The function will return false since an 'o' is lacking in the second string, and there is no special character that can replace the 'o' (h%l% would return true).
Now the problem isn't so much with writing the code, but with understanding how to solve the problem in general, I don't even know where to begin.
If someone could guide me in the right direction I would be very grateful, even by just using English words, I'll try to translate it to code (Java)...
Thank you.
So this is relatively easy to do in Python. The method I chose was to put the first string ("hello") into an array then iterate over the second string ("h%lo") comparing the elements to those in the array. If the element was in the array i.e. 'h', 'l', 'o' then I would pop it from the array. The resulting array is then ['e','l']. The special character can be found as it is the element which does not exist in the initial array.
One can then substitute the special character for the joined array "el" in the string and compare with the first string.
In the first case this will give "hello" == "hello" -> True
In the second case this will give "hello" == "helol" -> False
I hope this helps and makes sense.

Accessing a character in a borrowed string by index

How would you access an element in a borrowed string by index?
Straightforward in Python:
my_string_lst = list(my_string)
print my_string_list[0]
print my_string[0] # same as above
Rust (attempt 1):
let my_string_vec = vec![my_string]; # doesn't work
println!("{}", my_string_vec[0]); # returns entire of `my_string`
Rust (attempt 2):
let my_string_vec = my_string.as_bytes(); # returns a &[u8]
println!("{}", my_string_vec[0]); # prints nothing
My end goal is to stick it into a loop like this:
for pos in 0..my_string_vec.len() {
while shift <= pos && my_string_vec[pos] != my_string_vec[pos-shift] {
shift += shifts[pos-shift];
}
shifts[pos+1] = shift;
}
for ch in my_string_vec {
let pos = 0; // simulate some runtime index
if my_other_string_vec[pos] != ch {
...
}
}
I think it's possible to do use my_string_vec.as_bytes()[pos]and my_string_vec.as_bytes()[pos-shift]in my condition statement, but I feel that this has a bad code smell.
You can use char_at(index) to access a specific character. If you want to iterate over the characters in a string, you can use the chars() method which yields an iterator over the characters in the string.
The reason it was specifically not made possible to use indexing syntax is, IIRC, because indexing syntax would give the impression that it was like accessing a character in your typical C-string-like string, where accessing a character at a given index is a constant time operation (i.e. just accessing a single byte in an array). Strings in Rust, on the other hand, are Unicode and a single character may not necessarily consist of just one byte, making a specific character access a linear time operation, so it was decided to make that performance difference explicit and clear.
As far as I know, there is no method available for swapping characters in a string (see this question). Note that this wouldn't have been possible anyways via an immutably borrowed string, since such a string isn't yours to modify. You would have to most likely use a String, or perhaps a &mut str if you're strictly swapping, but I'm not too familiar with Unicode's intricacies.
I recommend instead you build up a String the way you want it, that way you don't have to worry about the mutability of the borrowed string. You'd refer/look into the borrowed string, and write into the output/build-up string accordingly based on your logic.
So this:
for pos in 0..my_string_vec.len() {
while shift <= pos && my_string_vec[pos] != my_string_vec[pos-shift] {
shift += shifts[pos-shift];
}
shifts[pos+1] = shift;
}
Might become something like this (not tested; not clear what your logic is for):
for ch in my_string.chars()
while shift <= pos && ch != my_string.char_at(pos - shift) {
// assuming shifts is a vec; not clear in question
shift += shifts[pos - shift];
}
shifts.push(shift);
}
Your last for loop:
for ch in my_string_vec {
let pos = 0; // simulate some runtime index
if my_other_string_vec[pos] != ch {
...
}
}
That kind of seems like you want to compare a given character in string A with the corresponding character (in the same position) of string B. For this I would recommend zipping the chars iterator of the first with the second, something like:
for (left, right) in my_string.chars().zip(my_other_string.chars()) {
if left != right {
}
}
Note that zip() stops iterating as soon as either iterator stops, meaning that if the strings are not the same length, then it'll only go as far as the shortest string.
If you need access to the "character index" information, you could add .enumerate() to that, so the above would change to:
for (index, (left, right)) in my_string.chars().zip(my_other_string.chars()).enumerate()

Resources