Splitting a UTF-8 string into chunks

Splitting a UTF-8 string into chunks - string

I want to split a UTF-8 string into chunks of equal size. I came up with a solution that does exactly that. Now I want to simplify it removing the first collect call if possible. Is there a way to do it?
fn main() {
let strings = "ĄĆĘŁŃÓŚĆŹŻ"
.chars()
.collect::<Vec<char>>()
.chunks(3)
.map(|chunk| chunk.iter().collect::<String>())
.collect::<Vec<String>>();
println!("{:?}", strings);
}
Playground link

You can use chunks() from Itertools.
use itertools::Itertools; // 0.10.1
fn main() {
let strings = "ĄĆĘŁŃÓŚĆŹŻ"
.chars()
.chunks(3)
.into_iter()
.map(|chunk| chunk.collect::<String>())
.collect::<Vec<String>>();
println!("{:?}", strings);
}

This doesn't require Itertools as a dependency and also does not allocate, as it iterates over slices of the original string:
fn chunks(s: &str, length: usize) -> impl Iterator<Item=&str> {
assert!(length > 0);
let mut indices = s.char_indices().map(|(idx, _)| idx).peekable();
std::iter::from_fn(move || {
let start_idx = match indices.next() {
Some(idx) => idx,
None => return None,
};
for _ in 0..length - 1 {
indices.next();
}
let end_idx = match indices.peek() {
Some(idx) => *idx,
None => s.bytes().len(),
};
Some(&s[start_idx..end_idx])
})
}
fn main() {
let strings = chunks("ĄĆĘŁŃÓŚĆŹŻ", 3).collect::<Vec<&str>>();
println!("{:?}", strings);
}

Having considered the problem with graphemes I ended up with the following solution.
I used the unicode-segmentation crate.
use unicode_segmentation::UnicodeSegmentation;
fn main() {
let strings = "ĄĆĘŁŃÓŚĆŹŻèèèèè"
.graphemes(true)
.collect::<Vec<&str>>()
.chunks(length)
.map(|chunk| chunk.concat())
.collect::<Vec<String>>();
println!("{:?}", strings);
}
I hope some simplifications can still be made.

Related

Convert &[Box<[u8]>] to String or &str

What's an efficient way to convert a result of type &[Box<[u8]>] into something more readily consumed like String or &str?
An example function is the txt_data() method from trust_dns_proto::rr:rdat::txt::TXT.
I've tried several things that seem to go nowhere, like:
fn main() {
let raw: &[Box<[u8]>] = &["Hello", " world!"]
.iter()
.map(|i| i.as_bytes().to_vec().into_boxed_slice())
.collect::<Vec<_>>();
let value = raw.iter().map(|s| String::from(*s)).join("");
assert_eq!(value, "Hello world!");
}
Where raw is of that type.

There is no way to convert an array of octets to str directly cause the data is split up. So a String look like a good candidate.
I would use str::from_utf8() combined with try_fold():
use std::str;
fn main() {
let raw: &[Box<[u8]>] = &["Hello", " world!"]
.iter()
.map(|i| i.as_bytes().to_vec().into_boxed_slice())
.collect::<Vec<_>>();
let value = raw
.iter()
.map(|i| str::from_utf8(i))
.try_fold(String::new(), |a, i| {
i.map(|i| {
let mut a = a;
a.push_str(i);
a
})
});
assert_eq!(value.as_ref().map(|x| x.as_str()), Ok("Hello world!"));
}

It looks like the solution is this:
let value: String = raw
.iter()
.map(|s| String::from_utf8((*s).to_vec()).unwrap())
.collect::<Vec<String>>()
.join("");
Where the key is from_utf8() and the (*s).to_vec() suggested by rustc.

Does Rust have an equivalent to Python's dictionary comprehension syntax?

How would one translate the following Python, in which several files are read and their contents are used as values to a dictionary (with filename as key), to Rust?
countries = {region: open("{}.txt".format(region)).read() for region in ["canada", "usa", "mexico"]}
My attempt is shown below, but I was wondering if a one-line, idiomatic solution is possible.
use std::{
fs::File,
io::{prelude::*, BufReader},
path::Path,
collections::HashMap,
};
macro_rules! map(
{ $($key:expr => $value:expr),+ } => {
{
let mut m = HashMap::new();
$(
m.insert($key, $value);
)+
m
}
};
);
fn lines_from_file<P>(filename: P) -> Vec<String>
where
P: AsRef<Path>,
{
let file = File::open(filename).expect("no such file");
let buf = BufReader::new(file);
buf.lines()
.map(|l| l.expect("Could not parse line"))
.collect()
}
fn main() {
let _countries = map!{ "canada" => lines_from_file("canada.txt"),
"usa" => lines_from_file("usa.txt"),
"mexico" => lines_from_file("mexico.txt") };
}

Rust's iterators have map/filter/collect methods which are enough to do anything Python's comprehensions can. You can create a HashMap with collect on an iterator of pairs, but collect can return various types of collections, so you may have to specify the type you want.
For example,
use std::collections::HashMap;
fn main() {
println!(
"{:?}",
(1..5).map(|i| (i + i, i * i)).collect::<HashMap<_, _>>()
);
}
Is roughly equivalent to the Python
print({i+i: i*i for i in range(1, 5)})
But translated very literally, it's actually closer to
from builtins import dict
def main():
print("{!r}".format(dict(map(lambda i: (i+i, i*i), range(1, 5)))))
if __name__ == "__main__":
main()
not that you would ever say it that way in Python.

Python's comprehensions are just sugar for a for loop and accumulator. Rust has macros--you can make any sugar you want.
Take this simple Python example,
print({i+i: i*i for i in range(1, 5)})
You could easily re-write this as a loop and accumulator:
map = {}
for i in range(1, 5):
map[i+i] = i*i
print(map)
You could do it basically the same way in Rust.
use std::collections::HashMap;
fn main() {
let mut hm = HashMap::new();
for i in 1..5 {
hm.insert(i + i, i * i);
}
println!("{:?}", hm);
}
You can use a macro to do the rewriting to this form for you.
use std::collections::HashMap;
macro_rules! hashcomp {
($name:ident = $k:expr => $v:expr; for $i:ident in $itr:expr) => {
let mut $name = HashMap::new();
for $i in $itr {
$name.insert($k, $v);
}
};
}
When you use it, the resulting code is much more compact. And this choice of separator tokens makes it resemble the Python.
fn main() {
hashcomp!(hm = i+i => i*i; for i in 1..5);
println!("{:?}", hm);
}
This is just a basic example that can handle a single loop. Python's comprehensions also can have filters and additional loops, but a more advanced macro could probably do that too.

Without using your own macros I think the closest to
countries = {region: open("{}.txt".format(region)).read() for region in ["canada", "usa", "mexico"]}
in Rust would be
let countries: HashMap<_, _> = ["canada", "usa", "mexico"].iter().map(|&c| {(c,read_to_string(c.to_owned() + ".txt").expect("Error reading file"),)}).collect();
but running a formatter, will make it more readable:
let countries: HashMap<_, _> = ["canada", "usa", "mexico"]
.iter()
.map(|&c| {
(
c,
read_to_string(c.to_owned() + ".txt").expect("Error reading file"),
)
})
.collect();
A few notes:
To map a vector, you need to transform it into an iterator, thus iter().map(...).
To transform an iterator back into a tangible data structure, e.g. a HashMap (dict), use .collect(). This is the advantage and pain of Rust, it is very strict with types, no unexpected conversions.
A complete test program:
use std::collections::HashMap;
use std::fs::{read_to_string, File};
use std::io::Write;
fn create_files() -> std::io::Result<()> {
let regios = [
("canada", "Ottawa"),
("usa", "Washington"),
("mexico", "Mexico city"),
];
for (country, capital) in regios {
let mut file = File::create(country.to_owned() + ".txt")?;
file.write_fmt(format_args!("The capital of {} is {}", country, capital))?;
}
Ok(())
}
fn create_hashmap() -> HashMap<&'static str, String> {
let countries = ["canada", "usa", "mexico"]
.iter()
.map(|&c| {
(
c,
read_to_string(c.to_owned() + ".txt").expect("Error reading file"),
)
})
.collect();
countries
}
fn main() -> std::io::Result<()> {
println!("Hello, world!");
create_files().expect("Failed to create files");
let countries = create_hashmap();
{
println!("{:#?}", countries);
}
std::io::Result::Ok(())
}
Not that specifying the type of countries is not needed here, because the return type of create_hashmap() is defined.

How do I convert a Peekable iterator back to the original iterator?

I want to implement an algorithm that skips ! or !^num at the start of a string:
fn extract_common_part(a: &str) -> Option<&str> {
let mut it = a.chars();
if it.next() != Some('!') {
return None;
}
let mut jt = it.clone().peekable();
if jt.peek() == Some(&'^') {
it.next();
jt.next();
while jt.peek().map_or(false, |v| !v.is_whitespace()) {
it.next();
jt.next();
}
it.next();
}
Some(it.as_str())
}
fn main() {
assert_eq!(extract_common_part("!^4324 1234"), Some("1234"));
assert_eq!(extract_common_part("!1234"), Some("1234"));
}
playground
This works, but I can not find way to return from Peekable to Chars, so I have to advance it and jt iterators. This causes duplicate code.
How can I return from Peekable iterator to corresponding Chars iterator, or maybe there is a simpler way to implement this algorithm?

In short, you cannot. The general answer is to use something like Iterator::by_ref to avoid consuming the Chars iterator:
fn extract_common_part(a: &str) -> Option<&str> {
let mut it = a.chars();
if it.next() != Some('!') {
return None;
}
{
let mut jt = it.by_ref().peekable();
if jt.peek() == Some(&'^') {
jt.next();
while jt.peek().map_or(false, |v| !v.is_whitespace()) {
jt.next();
}
}
}
Some(it.as_str())
}
The problem is that when you call peek and it fails, the underlying iterator has already been advanced. Getting the rest of the string will lose the character that tested false, returning 234.
However, Itertools has peeking_take_while and take_while_ref, both of which should solve the issue.
extern crate itertools;
use itertools::Itertools;
fn extract_common_part(a: &str) -> Option<&str> {
let mut it = a.chars();
if it.next() != Some('!') {
return None;
}
if it.peeking_take_while(|&c| c == '^').next() == Some('^') {
for _ in it.peeking_take_while(|v| !v.is_whitespace()) {}
for _ in it.peeking_take_while(|v| v.is_whitespace()) {}
}
Some(it.as_str())
}
Other options include:
using a crate like strcursor which is designed for this kind of incremental advance over a string.
do the parsing on regular strings directly, and hope the optimizer eliminates redundant bounds checks.
Use a regex or other parsing library

If you are only interested in the result, without validation:
fn extract_common_part(a: &str) -> Option<&str> {
a.chars().rev().position(|v| v.is_whitespace() || v == '!')
.map(|pos| &a[a.len() - pos..])
}
fn main() {
assert_eq!(extract_common_part("!^4324 1234"), Some("1234"));
assert_eq!(extract_common_part("!1234"), Some("1234"));
}

Implement slice_shift_char using the std library

I'd like to use the &str method slice_shift_char, but it is marked as unstable in the documentation:
Unstable: awaiting conventions about shifting and slices and may not
be warranted with the existence of the chars and/or char_indices
iterators
What would be a good way to implement this method, with Rust's current std library? So far I have:
fn slice_shift_char(s: &str) -> Option<(char, &str)> {
let mut ixs = s.char_indices();
let next = ixs.next();
match next {
Some((next_pos, ch)) => {
let rest = unsafe {
s.slice_unchecked(next_pos, s.len())
};
Some((ch, rest))
},
None => None
}
}
I'd like to avoid the call to slice_unchecked. I'm using Rust 1.1.

Well, you can look at the source code, and you'll get https://github.com/rust-lang/rust/blob/master/src/libcollections/str.rs#L776-L778 and https://github.com/rust-lang/rust/blob/master/src/libcore/str/mod.rs#L1531-L1539 . The second:
fn slice_shift_char(&self) -> Option<(char, &str)> {
if self.is_empty() {
None
} else {
let ch = self.char_at(0);
let next_s = unsafe { self.slice_unchecked(ch.len_utf8(), self.len()) };
Some((ch, next_s))
}
}
If you don't want the unsafe, you can just use a normal slice:
fn slice_shift_char(&self) -> Option<(char, &str)> {
if self.is_empty() {
None
} else {
let ch = self.char_at(0);
let len = self.len();
let next_s = &self[ch.len_utf8().. len];
Some((ch, next_s))
}
}

The unstable slice_shift_char function has been deprecated since Rust 1.9.0 and removed completely in Rust 1.11.0.
As of Rust 1.4.0, the recommended approach of implementing this is:
Use .chars() to get an iterator of the char content
Iterate on this iterator once to get the first character.
Call .as_str() on that iterator to recover the remaining uniterated string.
fn slice_shift_char(a: &str) -> Option<(char, &str)> {
let mut chars = a.chars();
chars.next().map(|c| (c, chars.as_str()))
}
fn main() {
assert_eq!(slice_shift_char("hello"), Some(('h', "ello")));
assert_eq!(slice_shift_char("ĺḿńóṕ"), Some(('ĺ', "ḿńóṕ")));
assert_eq!(slice_shift_char(""), None);
}

Using the same iterator multiple times in Rust

Editor's note: This code example is from a version of Rust prior to 1.0 when many iterators implemented Copy. Updated versions of this code produce a different errors, but the answers still contain valuable information.
I'm trying to write a function to split a string into clumps of letters and numbers; for example, "test123test" would turn into [ "test", "123", "test" ]. Here's my attempt so far:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
match iter.peek() {
None => return bits,
Some(c) => if c.is_digit() {
bits.push(iter.take_while(|c| c.is_digit()).collect());
} else {
bits.push(iter.take_while(|c| !c.is_digit()).collect());
}
}
}
return bits;
}
However, this doesn't work, looping forever. It seems that it is using a clone of iter each time I call take_while, starting from the same position over and over again. I would like it to use the same iter each time, advancing the same iterator over all the each_times. Is this possible?

As you identified, each take_while call is duplicating iter, since take_while takes self and the Peekable chars iterator is Copy. (Only true before Rust 1.0 — editor)
You want to be modifying the iterator each time, that is, for take_while to be operating on an &mut to your iterator. Which is exactly what the .by_ref adaptor is for:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
match iter.peek().map(|c| *c) {
None => return bits,
Some(c) => if c.is_digit(10) {
bits.push(iter.by_ref().take_while(|c| c.is_digit(10)).collect());
} else {
bits.push(iter.by_ref().take_while(|c| !c.is_digit(10)).collect());
},
}
}
}
fn main() {
println!("{:?}", split("123abc456def"))
}
Prints
["123", "bc", "56", "ef"]
However, I imagine this is not correct.
I would actually recommend writing this as a normal for loop, using the char_indices iterator:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
if input.is_empty() {
return bits;
}
let mut is_digit = input.chars().next().unwrap().is_digit(10);
let mut start = 0;
for (i, c) in input.char_indices() {
let this_is_digit = c.is_digit(10);
if is_digit != this_is_digit {
bits.push(input[start..i].to_string());
is_digit = this_is_digit;
start = i;
}
}
bits.push(input[start..].to_string());
bits
}
This form also allows for doing this with much fewer allocations (that is, the Strings are not required), because each returned value is just a slice into the input, and we can use lifetimes to state this:
pub fn split<'a>(input: &'a str) -> Vec<&'a str> {
let mut bits = vec![];
if input.is_empty() {
return bits;
}
let mut is_digit = input.chars().next().unwrap().is_digit(10);
let mut start = 0;
for (i, c) in input.char_indices() {
let this_is_digit = c.is_digit(10);
if is_digit != this_is_digit {
bits.push(&input[start..i]);
is_digit = this_is_digit;
start = i;
}
}
bits.push(&input[start..]);
bits
}
All that changed was the type signature, removing the Vec<String> type hint and the .to_string calls.
One could even write an iterator like this, to avoid having to allocate the Vec. Something like fn split<'a>(input: &'a str) -> Splits<'a> { /* construct a Splits */ } where Splits is a struct that implements Iterator<&'a str>.

take_while takes self by value: it consumes the iterator. Before Rust 1.0 it also was unfortunately able to be implicitly copied, leading to the surprising behaviour that you are observing.
You cannot use take_while for what you are wanting for these reasons. You will need to manually unroll your take_while invocations.
Here is one of many possible ways of dealing with this:
pub fn split(input: &str) -> Vec<String> {
let mut bits: Vec<String> = vec![];
let mut iter = input.chars().peekable();
loop {
let seeking_digits = match iter.peek() {
None => return bits,
Some(c) => c.is_digit(10),
};
if seeking_digits {
bits.push(take_while(&mut iter, |c| c.is_digit(10)));
} else {
bits.push(take_while(&mut iter, |c| !c.is_digit(10)));
}
}
}
fn take_while<I, F>(iter: &mut std::iter::Peekable<I>, predicate: F) -> String
where
I: Iterator<Item = char>,
F: Fn(&char) -> bool,
{
let mut out = String::new();
loop {
match iter.peek() {
Some(c) if predicate(c) => out.push(*c),
_ => return out,
}
let _ = iter.next();
}
}
fn main() {
println!("{:?}", split("test123test"));
}
This yields a solution with two levels of looping; another valid approach would be to model it as a state machine one level deep only. Ask if you aren’t sure what I mean and I’ll demonstrate.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Splitting a UTF-8 string into chunks - string

You can use chunks() from Itertools. use itertools::Itertools; // 0.10.1 fn main() { let strings = "ĄĆĘŁŃÓŚĆŹŻ" .chars() .chunks(3) .into_iter() .map(|chunk| chunk.collect::<String>()) .collect::<Vec<String>>(); println!("{:?}", strings); }

Related

Convert &[Box<[u8]>] to String or &str

Does Rust have an equivalent to Python's dictionary comprehension syntax?

How do I convert a Peekable iterator back to the original iterator?

Implement slice_shift_char using the std library

Using the same iterator multiple times in Rust

Categories

Resources