Rust reconstitute format=flowed emails, or an iterator that combines some elements of the previous iterator - rust

Currently I have a program that is reading some emails from disk, and parsing some included text (that is csv-like, although happens to be fixed-width fields and '|' separated.
The emails are not particularly huge, so I fs::read_to_string them into a string (in a loop), and for each one use .split("\n") to iterate over lines, then run a constructor on each line to create a struct for each valid csv-like line.
So like
let mut hostiter = text.split("\n")
.filter_map(|x| HostInfo::from_str(x));
Where HostInfo has owned values, copying from the &str references.
This all works fine as is, but now I want to be able to handle emails that quote the records I'm looking for (i.e. lines that start with "> > "). That's easy enough:
let quotes = &['>', ' '];
let mut hostiter = text.split("\n")
.map(|x| x.trim_start_matches(quotes))
.filter_map(|x| HostInfo::from_str(x));
I also need to cope with rfc3676/format=flowed emails? This means that, when forwarded/replied to, email clients split the lines so that each record I'm looking for is split over 2 or more lines. Continuation lines are delineated with " \r\n", i.e. it has a space before the cr/newline. Non-continuation lines have the "\r\n" after a non-space character. (Currently my code skips these partial records.) I need an iterator that iterates over complete lines. I'm thinking of two ways of doing this:
The easiest may be to split the string (on '\n'), trim the starts of any quoting, then collect the string into a new string with '\n' separating to remove the quotes. Then a second pass to replace all " \r\n" with ' ' again producing a new string. Now I have a string that can be split on '\n' and has complete records.
Else is there an iterator adapter I can use that will combine elements if they are continuation lines? e.g. can I use group_by to group lines with their continuation lines?
I realize I can't have an iterator that returns complete records as a single &str (unless I do 1.), since the records are split in the original string. However I can refactor my constructor to take a vector of &str instead of a single &str.

In the end I used coalesce to group the lines. Since the items I'm iterating over are &str which can't be joined without allocation I decided to store the output as Vec<&str>. Since coalesce wants the same types as input and output (why?), I needed to convert the &str to single item vectors before using it. The resulting code was:
let mut hostiter = text.split("\r\n")
.map(|x| vec![x.trim_start_matches(quotes)])
.coalesce(|mut x, mut y| match o.flowed && x[x.len()-1].ends_with(' ') {
true => { x.append(&mut y); Ok( x )},
false => Err( (x,y) ),
})
.filter_map(|x| HostInfo::from_vec_str(x);
(o.flowed is a flag indicating whether we picked up a Content type: with format=flowed in the headers of the email.)
I had to convert my HostInfo::from_str function to HostInfo::from_vec_str to take a Vec<&str> instead of a &str. Since my from_str function splits the &str on spaces anyway, it was easy enough to use flat_map to split each &str in the Vec and output words...
Not sure if coalesce is the best way to do this. I was looking for an iterator adaptor that would take a closure that takes a collection and an item, and returns a bool; I.e. does this item belong with the other items in this collection? The iterator adaptor output would iterate over collections of items.

Related

Problem splitting regular ASCII symbols in a string

Just had this error pop up while messing around with some graphics for a terminal interface...
thread 'main' panicked at 'byte index 2 is not a char boundary; it is inside '░' (bytes 1..4) of ░▒▓█', src/main.rs:38:6
Can I not use these characters, or do I need to work some magic to support what I thought were default ASCII characters?
(Here's the related code for those wondering.)
// Example call with the same parameters that led to this issue.
charlist(" ░▒▓█".to_string(), 0.66);
// Returns the n-th character in a string.
// (Where N is a float value from 0 to 1,
// 0 being the start of the string and 1 the end.)
fn charlist<'a>(chars: &'a String, amount: f64) -> &'a str {
let chl: f64 = chars.chars().count() as f64; // Length of the string
let chpos = -((amount*chl)%chl) as i32; // Scalar converted to integer position in the string
&chars[chpos as usize..chpos as usize+1] // Slice the single requested character
}
There are couple misconceptions you seem to have. So let me address them in order.
░, ▒, ▓ and █ are not ASCII characters! They are unicode code points. You can determine this with following simple experiment.
fn main() {
let slice = " ░▒▓█";
for c in slice.chars() {
println!("{}, {}", c, c.len_utf8());
}
}
This code has output:
, 1
░, 3
▒, 3
▓, 3
█, 3
As you can see this "boxy" characters have a length of 3 bytes each! Rust uses utf-8 encoding for all of it's strings. This leads to another misconception.
I this line &chars[chpos as usize..chpos as usize+1] you are trying to get a slice of one byte in length. String slices in rust are indexed with bytes. But you tried to slice in the middle of a character (it has length of 3 bytes). In general characters in utf-8 encoding can be from one to four bytes in length. To get char's length in bytes you can use method len_utf8.
You can get an iterator of characters in a string slice using method chars. Then getting n-th character is as easy as using iterators method nth So the following is true:
assert_eq!(" ░▒▓█".chars().nth(3).unwrap(), '▒');
If you want to have also indices of this chars you can use method char_indices.
Using f64 values to represent nth character is odd and I would encourage you rethink if you really want to do this. But if you do you have two options.
You must remember that since characters have a variable length, string's slice method len doesn't return number of characters, but slice's length in bytes. To know how many characters are in the string you have no other option than iterating over it. So if you for example want to have a middle character you must first know how many there are. I can think of two ways you can do this.
You can either collect characters for Vec<char> (or something similar). Then you will know how many characters are there and can in O(1) index nth one. However this will result in additional memory allocation.
You can fist count how many characters there are with slice.chars().len(). Then calculate position of the nth one and get it by again iterating over chars and getting the nth one (as I showed above). This won't result in any additional memory allocation, but it will have complexity of O(2n), since you will have to iterate over whole string twice.
Which one you pick is up to you. You will have to make a compromise.
This isn't really a correctness problem, but prefer using &str over &String in the arguments of functions (as it will provide more flexibility to your callers). And you don't have to specify lifetime if you have only one reference in the arguments and the other one is in the returned type. Rust will infer that they have to have the same lifetime.

Need to extract the last word in a Rust string

I am doing some processing of a string in Rust, and I need to be able to extract the last set of characters from that string. In other words, given a string like the following:
some|not|necessarily|long|name
I need to be able to get the last part of that string, namely "name" and put it into another String or a &str, in a manner like:
let last = call_some_function("some|not|necessarily|long|name");
so that last becomes equal to "name".
Is there a way to do this? Is there a string function that will allow this to be done easily? If not (after looking at the documentation, I doubt that there is), how would one do this in Rust?
While the answer from #effect is correct, it is not the most idiomatic nor the most performant way to do it. It'll walk the entire string and match all of the |s to reach the last. You can make it better, but there is a method of str that does exactly what you want - rsplit_once():
let (_, name) = s.rsplit_once('|').unwrap();
// Or
// let name = s.rsplit_once('|').unwrap().1;
//
// You can also use a multichar separator:
// let (_, name) = s.rsplit_once("|").unwrap();
// But in the case of a single character, a `char` type is likely to be more performant.
Playground.
You can use the String::split() method, which will return an iterator over the substrings split by that separator, and then use the Iterator::last() method to return the last element in the iterator, like so:
let s = String::from("some|not|necessarily|long|name");
let last = s.split('|').last().unwrap();
assert_eq!(last, "name");
Please also note that string slices (&str) also implement the split method, so you don't need to use std::String.
let s = "some|not|necessarily|long|name";
let last = s.split('|').last().unwrap();
assert_eq!(last, "name");

Rust split vector of bytes by specific bytes

I have a file containing information that I want to load in the application. The file has some header infos as string, then multiple entries that are ended by ';' Some entries are used for different types and therefore lenght is variable, but all variables are separated by ','
Example:
\Some heading
\Some other heading
I,003f,3f3d00ed,"Some string",00ef,
0032,20f3
;
Y,02d1,0000,0000,"Name of element",
00000007,0,
00000000,0,
;
Y,02d1,0000,0000,"Name of element",30f0,2d0f,02sd,
00000007,0,
00000000,0,
;
I is one type of element
Y is another type of element
What I want to achieve is, to bring the elements into different structs to work with. Most of the values are numbers but some are strings.
What I was able to achieve is:
Import the file as Vec<u8>
Put it in a string (can't do that directly, beacuse there may be UTF-8 problems in elements I'm not interested in)
Split it to a Vec<&str> by ';'
Pass the strings to functions depending on their type
Split it to a Vec by '\n'
Split it to a Vec by ','
Reading out the data I need and interpret from the strings (str::from_str_radix for example)
Buld the struct and return it
This seems not to be the way to go, since I start with bytes, allocate them as string and then again allocate numbers on most of the values.
So my question is:
Can I split the Vec<u8> into multiple vectors separated by ';' (byte 59), split these further by '\n' and split this further by ','.
I assume it would be more performant to apply the bytes directly to the correct data-type. Or is my concern wrong?
Can I split the Vec into multiple vectors separated by ';' (byte 59), split these further by '\n' and split this further by ','.
Usually that is not going to work if the other bytes may appear in other places, like embedded in the strings.
Then there is also the question of how the strings are encoded, whether there are escape sequences, etc.
I assume it would be more performant to apply the bytes directly to the correct data-type. Or is my concern wrong?
Reading the entire file into memory and then performing several copies from one Vec to another Vec and another and so on is going to be slower than a single pass with a state machine of some kind. Not to mention it will make working with files bigger than memory extremely slow or impossible.
I wouldn't worry about performance until you have a working algorithm, in particular if you have to work with an undocumented, non-trivial, third-party format and you are not experienced at reading binary formats.
An answer not exactly for this question - but i got directed here and didn't get a better answer.
The function below takes a byte slice pointer and another byte slice and splits the first by the second.
fn split_bytes<'a>(bs: &'a[u8], pred: &'a[u8]) -> Vec<&'a[u8]> {
let mut indexes: Vec<(usize, usize)> = Vec::new();
for (index, el) in bs.windows(pred.len()).enumerate() {
if el == pred {
indexes.push((index, index + pred.len()));
}
};
indexes.reverse();
let mut cur = bs.clone();
let mut res: Vec<&[u8]> = Vec::new();
for (start, end) in indexes.to_owned() {
let (first_left, first_right) = cur.split_at(end);
res.push(first_right);
let (second_left, _) = first_left.split_at(start);
cur = second_left
}
res
}
Here is a demo link in the Rust playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=ff653e2be80d4f73542e26dc37c46f13

Slice a string containing Unicode chars

I have a piece of text with characters of different bytelength.
let text = "Hello привет";
I need to take a slice of the string given start (included) and end (excluded) character indices. I tried this
let slice = &text[start..end];
and got the following error
thread 'main' panicked at 'byte index 7 is not a char boundary; it is inside 'п' (bytes 6..8) of `Hello привет`'
I suppose it happens since Cyrillic letters are multi-byte and the [..] notation takes chars using byte indices. What can I use if I want to slice using character indices, like I do in Python:
slice = text[start:end] ?
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
Possible solutions to codepoint slicing
I know I can use the chars() iterator and manually walk through the desired substring, but is there a more concise way?
If you know the exact byte indices, you can slice a string:
let text = "Hello привет";
println!("{}", &text[2..10]);
This prints "llo пр". So the problem is to find out the exact byte position. You can do that fairly easily with the char_indices() iterator (alternatively you could use chars() with char::len_utf8()):
let text = "Hello привет";
let end = text.char_indices().map(|(i, _)| i).nth(8).unwrap();
println!("{}", &text[2..end]);
As another alternative, you can first collect the string into Vec<char>. Then, indexing is simple, but to print it as a string, you have to collect it again or write your own function to do it.
let text = "Hello привет";
let text_vec = text.chars().collect::<Vec<_>>();
println!("{}", text_vec[2..8].iter().cloned().collect::<String>());
Why is this not easier?
As you can see, neither of these solutions is all that great. This is intentional, for two reasons:
As str is a simply UTF8 buffer, indexing by unicode codepoints is an O(n) operation. Usually, people expect the [] operator to be a O(1) operation. Rust makes this runtime complexity explicit and doesn't try to hide it. In both solutions above you can clearly see that it's not O(1).
But the more important reason:
Unicode codepoints are generally not a useful unit
What Python does (and what you think you want) is not all that useful. It all comes down to the complexity of language and thus the complexity of unicode. Python slices Unicode codepoints. This is what a Rust char represents. It's 32 bit big (a few fewer bits would suffice, but we round up to a power of 2).
But what you actually want to do is slice user perceived characters. But this is an explicitly loosely defined term. Different cultures and languages regard different things as "one character". The closest approximation is a "grapheme cluster". Such a cluster can consist of one or more unicode codepoints. Consider this Python 3 code:
>>> s = "Jürgen"
>>> s[0:2]
'Ju'
Surprising, right? This is because the string above is:
0x004A LATIN CAPITAL LETTER J
0x0075 LATIN SMALL LETTER U
0x0308 COMBINING DIAERESIS
...
This is an example of a combining character that is rendered as part of the previous character. Python slicing does the "wrong" thing here.
Another example:
>>> s = "fire"
>>> s[0:2]
'fir'
Also not what you'd expect. This time, fi is actually the ligature fi, which is one codepoint.
There are far more examples where Unicode behaves in a surprising way. See the links at the bottom for more information and examples.
So if you want to work with international strings that should be able to work everywhere, don't do codepoint slicing! If you really need to semantically view the string as a series of characters, use grapheme clusters. To do that, the crate unicode-segmentation is very useful.
Further resources on this topic:
Blogpost "Let's stop ascribing meaning to unicode codepoints"
Blogpost "Breaking our Latin-1 assumptions
http://utf8everywhere.org/
A UTF-8 encoded string may contain characters which consists of multiple bytes. In your case, п starts at index 6 (inclusive) and ends at position 8 (exclusive) so indexing 7 is not the start of the character. This is why your error occurred.
You may use str::char_indices() for solving this (remember, that getting to a position in a UTF-8 string is O(n)):
fn get_utf8_slice(string: &str, start: usize, end: usize) -> Option<&str> {
assert!(end >= start);
string.char_indices().nth(start).and_then(|(start_pos, _)| {
string[start_pos..]
.char_indices()
.nth(end - start - 1)
.map(|(end_pos, _)| &string[start_pos..end_pos])
})
}
playground
You may use str::chars() if you are fine with getting a String:
let string: String = text.chars().take(end).skip(start).collect();
Here is a function which retrieves a utf8 slice, with the following pros:
handle all edge cases (empty input, 0-width output ranges, out-of-scope ranges);
never panics;
use start-inclusive, end-exclusive ranges.
pub fn utf8_slice(s: &str, start: usize, end: usize) -> Option<&str> {
let mut iter = s.char_indices()
.map(|(pos, _)| pos)
.chain(Some(s.len()))
.skip(start)
.peekable();
let start_pos = *iter.peek()?;
for _ in start..end { iter.next(); }
Some(&s[start_pos..*iter.peek()?])
}

[] operator for strings, link with slices for vectors

Why do you have to walk over the string to find the nᵗʰ letter of a string when you do s[n] where s is a string. (According to https://doc.rust-lang.org/book/strings.html)
From what I understood, a string is an array of chars and a char is an array of 4 bytes or a number of 4 bytes. So is getting the nth letter would be similar as doing this : v[4*n..4*n+4] where v is a vector ?
What is the cost of v[i..j] ?
I would assume that the cost of v[i..j] is j-i and so that the cost of s[n] should be 4.
Note: The second edition of The Rust Programming Language has an improved and smooth explanation to Strings in Rust, which you might wish to read as well. The answer below, although still accurate, quotes from the first edition of the book.
I will try to clarify these misconceptions about strings in Rust by quoting from the book (https://doc.rust-lang.org/book/strings.html).
A ‘string’ is a sequence of Unicode scalar values encoded as a stream of UTF-8 bytes. All strings are guaranteed to be a valid encoding of UTF-8 sequences.
With this in mind, plus that UTF-8 code points are variably sized (1 to 4 bytes depending on the character), all strings in Rust, whether they are &str or String, are not arrays of characters, and can not be treated like such. It is further explained why on Slicing:
Because strings are valid UTF-8, they do not support indexing:
let s = "hello";
println!("The first letter of s is {}", s[0]); // ERROR!!!
Usually, access to a vector with [] is very fast. But, because each character in a UTF-8 encoded string can be multiple bytes, you have to walk over the string to find the nᵗʰ letter of a string. This is a significantly more expensive operation, and we don’t want to be misleading.
Unlike what was mentioned in the question, one cannot do s[n], because although in theory this would allows us to fetch the nth byte in constant time, that byte is not guaranteed to make any sense on its own.
What is the cost of v[i..j] ?
The cost of slicing is actually constant, because it is done at byte-level:
You can get a slice of a string with slicing syntax:
let dog = "hachiko";
let hachi = &dog[0..5];
But note that these are byte offsets, not character offsets. So this will fail at runtime:
let dog = "忠犬ハチ公";
let hachi = &dog[0..2];
with this error:
thread '' panicked at 'index 0 and/or 2 in 忠犬ハチ公 do not lie on
character boundary'
Basically, slicing is acceptable and will yield a new view of that string, so no copies are made. However, it should only be used when you are completely sure that the offsets are right in terms of character boundaries.
In order to iterate over each character of a string, you may instead call chars():
let c = s.chars().nth(n);
Even with that in mind, note that handling Unicode character might not be exactly what you want if you wish to handle character modifiers in UTF-8 (which are scalar values by themselves but should not be treated individually either). Quoting now from the str API:
fn chars(&self) -> Chars
Returns an iterator over the chars of a string slice.
As a string slice consists of valid UTF-8, we can iterate through a string slice by char. This method returns such an iterator.
It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want.
Remember, chars may not match your human intuition about characters:
let y = "y̆";
let mut chars = y.chars();
assert_eq!(Some('y'), chars.next()); // not 'y̆'
assert_eq!(Some('\u{0306}'), chars.next());
assert_eq!(None, chars.next());
The unicode_segmentation crate provides a means to define grapheme cluster boundaries:
extern crate unicode_segmentation;
use unicode_segmentation::UnicodeSegmentation;
let s = "a̐éö̲\r\n";
let g = UnicodeSegmentation::graphemes(s, true).collect::<Vec<&str>>();
let b: &[_] = &["a̐", "é", "ö̲", "\r\n"];
assert_eq!(g, b);
If you do want to treat the string as an array of codepoints (which isn't strictly the same as characters; there are combining marks, emoji with separate skin-tone modifiers, etc.), you can collect it into a Vec:
fn main() {
let s = "£10 🙃!";
for (i,c) in s.char_indices() {
println!("{} {}", i, c);
}
let v: Vec<char> = s.chars().collect();
println!("v[5] = {}", v[5]);
}
Play link
With bonus demonstration of some varying character widths, this outputs:
0 £
2 1
3 0
4
5 🙃
9 !
v[5] = !

Resources