Split a string keeping the separators

Split a string keeping the separators - rust

Is there a trivial way to split a string keeping the separators?
Instead of this:
let texte = "Ten. Million. Questions. Let's celebrate all we've done together.";
let v: Vec<&str> = texte.split(|c: char| !(c.is_alphanumeric() || c == '\'')).filter(|s| !s.is_empty()).collect();
which results with ["Ten", "Million", "Questions", "Let's", "celebrate", "all", "we've", "done", "together"].
I would like something that gives me :
["Ten", ".", " ", "Million", ".", " ", "Questions", ".", " ", "Let's", " ", "celebrate", " ", "all", " ", "we've", " ", "done", " ", "together", "."].
I am trying that kind of code (it assumes the string begins with a letter and ends with a 'non'-letter) :
let texte = "Ten. Million. Questions. Let's celebrate all we've done together. ";
let v1: Vec<&str> = texte.split(|c: char| !(c.is_alphanumeric() || c == '\'')).filter(|s| !s.is_empty()).collect();
let v2: Vec<&str> = texte.split(|c: char| c.is_alphanumeric() || c == '\'').filter(|s| !s.is_empty()).collect();
let mut w: Vec<&str> = Vec::new();
let mut j = 0;
for i in v2 {
w.push(v1[j]);
w.push(i);
j = j+1;
}
It gives me almost the result I wrote earlier but it's good :
["Ten", ". ", "Million", ". ", "Questions", ". ", "Let's", " ", "celebrate", " ", "all", " ", "we've", " ", "done", " ", "together", "."]
However is there a better way to code that ? Because I tried to enumerate on v2 but it didn't work, and it looks rough to use j in the for loop.

Using str::match_indices:
let text = "Ten. Million. Questions. Let's celebrate all we've done together.";
let mut result = Vec::new();
let mut last = 0;
for (index, matched) in text.match_indices(|c: char| !(c.is_alphanumeric() || c == '\'')) {
if last != index {
result.push(&text[last..index]);
}
result.push(matched);
last = index + matched.len();
}
if last < text.len() {
result.push(&text[last..]);
}
println!("{:?}", result);
Prints:
["Ten", ".", " ", "Million", ".", " ", "Questions", ".", " ", "Let\'s", " ", "celebrate", " ", "all", " ", "we\'ve", " ", "done", " ", "together", "."]

str::split_inclusive, available since Rust 1.51, returns an iterator keeping the delimiters as part of the matched strings, and may be useful in certain cases:
#[test]
fn split_with_delimiter() {
let items: Vec<_> = "alpha,beta;gamma"
.split_inclusive(&[',', ';'][..])
.collect();
assert_eq!(&items, &["alpha,", "beta;", "gamma"]);
}
#[test]
fn split_with_delimiter_allows_consecutive_delimiters() {
let items: Vec<_> = ",;".split_inclusive(&[',', ';'][..]).collect();
assert_eq!(&items, &[",", ";"]);
}

I was not able to find anything in the standard library, so I wrote my own:
This version uses the unstable pattern API as it's more flexible, but the link above has a fallback that I've hardcoded for my specific stable usecase.
#![feature(pattern)]
use std::str::pattern::{Pattern, Searcher};
#[derive(Copy, Clone, Debug, PartialEq)]
pub enum SplitType<'a> {
Match(&'a str),
Delimiter(&'a str),
}
pub struct SplitKeepingDelimiter<'p, P>
where
P: Pattern<'p>,
{
searcher: P::Searcher,
start: usize,
saved: Option<usize>,
}
impl<'p, P> Iterator for SplitKeepingDelimiter<'p, P>
where
P: Pattern<'p>,
{
type Item = SplitType<'p>;
fn next(&mut self) -> Option<Self::Item> {
if self.start == self.searcher.haystack().len() {
return None;
}
if let Some(end_of_match) = self.saved.take() {
let s = &self.searcher.haystack()[self.start..end_of_match];
self.start = end_of_match;
return Some(SplitType::Delimiter(s));
}
match self.searcher.next_match() {
Some((start, end)) => {
if self.start == start {
let s = &self.searcher.haystack()[start..end];
self.start = end;
Some(SplitType::Delimiter(s))
} else {
let s = &self.searcher.haystack()[self.start..start];
self.start = start;
self.saved = Some(end);
Some(SplitType::Match(s))
}
}
None => {
let s = &self.searcher.haystack()[self.start..];
self.start = self.searcher.haystack().len();
Some(SplitType::Match(s))
}
}
}
}
pub trait SplitKeepingDelimiterExt: ::std::ops::Index<::std::ops::RangeFull, Output = str> {
fn split_keeping_delimiter<P>(&self, pattern: P) -> SplitKeepingDelimiter<P>
where
P: for<'a> Pattern<'a>,
{
SplitKeepingDelimiter {
searcher: pattern.into_searcher(&self[..]),
start: 0,
saved: None,
}
}
}
impl SplitKeepingDelimiterExt for str {}
#[cfg(test)]
mod test {
use super::SplitKeepingDelimiterExt;
#[test]
fn split_with_delimiter() {
use super::SplitType::*;
let delims = &[',', ';'][..];
let items: Vec<_> = "alpha,beta;gamma".split_keeping_delimiter(delims).collect();
assert_eq!(
&items,
&[
Match("alpha"),
Delimiter(","),
Match("beta"),
Delimiter(";"),
Match("gamma")
]
);
}
#[test]
fn split_with_delimiter_allows_consecutive_delimiters() {
use super::SplitType::*;
let delims = &[',', ';'][..];
let items: Vec<_> = ",;".split_keeping_delimiter(delims).collect();
assert_eq!(&items, &[Delimiter(","), Delimiter(";")]);
}
}
You'll note that I needed to track if something was one of the delimiters or not, but that should be easy to adapt if you don't need it.

Related

Take N elements with saving that satisfies predicate

I have that vec of strings
vec![
"import a\n",
"\n",
"\n",
"b = 1 + 2\n",
"\n",
"print(b)\n",
"print(b + 1)\n",
"\n"
];
And I want to take first 3 non "\n" lines also saving all "\n" lines between them. So that result would be this
vec![
"import a\n",
"\n",
"\n",
"b = 1 + 2\n",
"\n",
"print(b)\n"
];
Ideally if it could be done like this
lines.take_n_saving(3, |line| line == "\n")

Vec::retain() can do this in-place, but you need an external counter (captured by the closure).
fn main() {
let mut lines = vec![
"import a\n",
"\n",
"\n",
"b = 1 + 2\n",
"\n",
"print(b)\n",
"print(b + 1)\n",
"\n",
];
let mut keep = 3;
lines.retain(|l| {
let result = keep > 0;
if *l != "\n" {
keep -= 1;
}
result
});
println!("{:?}", lines);
}
/*
["import a\n", "\n", "\n", "b = 1 + 2\n", "\n", "print(b)\n"]
*/

You can use std::iter::filter. Below is an example:
fn take_n_saving<'a, F: Fn(&str) -> bool>(n: i32, input: Vec<&'a str>, f: F)
-> Vec<&'a str>
{
let mut count = 0;
return input
.into_iter()
.filter(|x| match count >= n {
true => false,
false => {
if !f(x) {
count += 1;
}
true
}
})
.collect();
}
fn main() {
let input = vec![
"import a\n",
"\n",
"\n",
"b = 1 + 2\n",
"\n",
"print(b)\n",
"print(b + 1)\n",
"\n",
];
let output: Vec<_> = take_n_saving(3, input, |x: &str| x == "\n");
println!("{:?}", output);
}

I just search through rust source code and found that .take() uses Take under the hood. I think that this Take can be subclassed from something like TakePred
impl<I> Iterator for TakePred<I> where I: Iterator {
fn next(&mut self) -> Option<<I as Iterator>::Item> {
if self.n != 0 {
let elem = self.iter.next()
if !self.pred(elem) {
self.n -= 1
}
elem
} else {
None
}
}
}
So that solution for my problem would be this
lines.take_pred(3, |line| line == "\n")
And Take implementation would be this (pseudo)
Take(n) = TakePred(n, pred: always())
But it is just an idea, not an exact solution.

How to make this iterator/for loop idiomatic in Rust

How can I get rid of the ugly let mut i = 0; and i += 1;? Is there a more idomatic way I can write this loop? I tried .enumerate() but it doesn't work on a &[&str].
use std::fmt::Write;
pub fn build_proverb(list: &[&str]) -> String {
let mut proverb = String::new();
let mut i = 0;
for word in list {
i += 1;
if i < list.len() {
write!(proverb, "For want of a {} the {} was lost.\n", word, list[i]);
} else {
write!(proverb, "And all for the want of a {}.", list[0]);
}
}
proverb
}

You were close, enumerate() doesn't exist for &[&str] because enumerate() is a function on Iterator, and &[&str] is not an Iterator.
But you can call list.iter() to get an iterator, which you can then call enumerate() on.
Full example:
pub fn build_proverb(list: &[&str]) -> String {
let mut proverb = String::new();
for (i, word) in list.iter().enumerate() {
if i < list.len() {
write!(proverb, "For want of a {} the {} was lost.\n", word, list[i]);
} else {
write!(proverb, "And all for the want of a {}.", list[0]);
}
}
proverb
}
The fact you have word and list[i] (after incrementing i) makes it look like you want something like windows:
pub fn build_proverb(list: &[&str]) -> String {
let mut proverb = String::new();
for words in list.windows(2) {
let first = words[0];
let second = words[1];
write!(proverb, "For want of a {} the {} was lost.\n", first, second);
}
write!(proverb, "And all for the want of a {}.", list[0]);
proverb
}

How to shuffle a vector except for the first and last elements without using third party libraries?

I have a task to shuffle words but the first and last letter of every word must be unchanged. When I try to use filter() it doesn't work properly.
const SEPARATORS: &str = " ,;:!?./%*$=+)#_-('\"&1234567890\r\n";
fn main() {
print!("MAIN:{:?}", mix("Evening,morning"));
}
fn mix(s: &str) -> String {
let mut a: Vec<char> = s.chars().collect();
for group in a.split_mut(|num| SEPARATORS.contains(*num)) {
if group.len() > 4 {
let k = group.first().unwrap().clone();
let c = group[group.len() - 1].clone();
group
.chunks_exact_mut(2)
.filter(|x| x != &[k])
.for_each(|x| x.swap(0, 1))
}
}
let s: String = a.iter().collect();
s
}

Is this what you are looking for?
fn mix(s: &str) -> String {
let mut a: Vec<char> = s.chars().collect();
for words in a.split_mut(|num| SEPARATORS.contains(*num)) {
if words.len() > 4 {
let initial_letter = words.first().unwrap().clone();
let last_letter = words[words.len() - 1].clone();
words[0] = last_letter;
words[words.len() - 1] = initial_letter;
}
}
let s: String = a.iter().collect();
s
}

How to trim space less than n times?

How to eliminate up to n spaces at the beginning of each line?
For example, when trim 4 space:
" 5" ->" 5"
" 4" ->"4"
" 3" ->"3"
const INPUT:&str = " 4\n 2\n0\n\n 6\n";
const OUTPUT:&str = "4\n2\n0\n\n 6\n";
#[test]
fn main(){
assert_eq!(&trim_deindent(INPUT,4), OUTPUT)
}

I was about to comment textwrap::dedent, but then I noticed "2", which has less than 4 spaces. So you wanted it to keep removing spaces, if there is any up until 4.
Just writing a quick solution, it could look something like this:
Your assert will pass, but note that lines ending in \r\n will be converted to \n, as lines does not provide a way to differentiate between \n and \r\n.
fn trim_deindent(text: &str, max: usize) -> String {
let mut new_text = text
.lines()
.map(|line| {
let mut max = max;
line.chars()
// Skip while `c` is a whitespace and at most `max` spaces
.skip_while(|c| {
if max == 0 {
false
} else {
max -= 1;
c.is_whitespace()
}
})
.collect::<String>()
})
.collect::<Vec<_>>()
.join("\n");
// Did the original `text` end with a `\n` then add it again
if text.ends_with('\n') {
new_text.push('\n');
}
new_text
}
If you want to retain both \n and \r\n then you can go a more complex route of scanning through the string, and thus avoiding using lines.
fn trim_deindent(text: &str, max: usize) -> String {
let mut new_text = String::new();
let mut line_start = 0;
loop {
let mut max = max;
// Skip `max` spaces
let after_space = text[line_start..].chars().position(|c| {
// We can't use `is_whitespace` here, as that will skip past `\n` and `\r` as well
if (max == 0) || !is_horizontal_whitespace(c) {
true
} else {
max -= 1;
false
}
});
if let Some(after_space) = after_space {
let after_space = line_start + after_space;
let line = &text[after_space..];
// Find `\n` or use the line length (if it's the last line)
let end = line
.chars()
.position(|c| c == '\n')
.unwrap_or_else(|| line.len());
// Push the line (including the line ending) onto `new_text`
new_text.push_str(&line[..=end]);
line_start = after_space + end + 1;
} else {
break;
}
}
new_text
}
#[inline]
fn is_horizontal_whitespace(c: char) -> bool {
(c != '\r') && (c != '\n') && c.is_whitespace()
}

Is there a method like JavaScript's substr in Rust?

I looked at the Rust docs for String but I can't find a way to extract a substring.
Is there a method like JavaScript's substr in Rust? If not, how would you implement it?
str.substr(start[, length])
The closest is probably slice_unchecked but it uses byte offsets instead of character indexes and is marked unsafe.

For characters, you can use s.chars().skip(pos).take(len):
fn main() {
let s = "Hello, world!";
let ss: String = s.chars().skip(7).take(5).collect();
println!("{}", ss);
}
Beware of the definition of Unicode characters though.
For bytes, you can use the slice syntax:
fn main() {
let s = b"Hello, world!";
let ss = &s[7..12];
println!("{:?}", ss);
}

You can use the as_str method on the Chars iterator to get back a &str slice after you have stepped on the iterator. So to skip the first start chars, you can call
let s = "Some text to slice into";
let mut iter = s.chars();
iter.by_ref().nth(start); // eat up start values
let slice = iter.as_str(); // get back a slice of the rest of the iterator
Now if you also want to limit the length, you first need to figure out the byte-position of the length character:
let end_pos = slice.char_indices().nth(length).map(|(n, _)| n).unwrap_or(0);
let substr = &slice[..end_pos];
This might feel a little roundabout, but Rust is not hiding anything from you that might take up CPU cycles. That said, I wonder why there's no crate yet that offers a substr method.

This code performs both substring-ing and string-slicing, without panicking nor allocating:
use std::ops::{Bound, RangeBounds};
trait StringUtils {
fn substring(&self, start: usize, len: usize) -> &str;
fn slice(&self, range: impl RangeBounds<usize>) -> &str;
}
impl StringUtils for str {
fn substring(&self, start: usize, len: usize) -> &str {
let mut char_pos = 0;
let mut byte_start = 0;
let mut it = self.chars();
loop {
if char_pos == start { break; }
if let Some(c) = it.next() {
char_pos += 1;
byte_start += c.len_utf8();
}
else { break; }
}
char_pos = 0;
let mut byte_end = byte_start;
loop {
if char_pos == len { break; }
if let Some(c) = it.next() {
char_pos += 1;
byte_end += c.len_utf8();
}
else { break; }
}
&self[byte_start..byte_end]
}
fn slice(&self, range: impl RangeBounds<usize>) -> &str {
let start = match range.start_bound() {
Bound::Included(bound) | Bound::Excluded(bound) => *bound,
Bound::Unbounded => 0,
};
let len = match range.end_bound() {
Bound::Included(bound) => *bound + 1,
Bound::Excluded(bound) => *bound,
Bound::Unbounded => self.len(),
} - start;
self.substring(start, len)
}
}
fn main() {
let s = "abcdèfghij";
// All three statements should print:
// "abcdè, abcdèfghij, dèfgh, dèfghij."
println!("{}, {}, {}, {}.",
s.substring(0, 5),
s.substring(0, 50),
s.substring(3, 5),
s.substring(3, 50));
println!("{}, {}, {}, {}.",
s.slice(..5),
s.slice(..50),
s.slice(3..8),
s.slice(3..));
println!("{}, {}, {}, {}.",
s.slice(..=4),
s.slice(..=49),
s.slice(3..=7),
s.slice(3..));
}

For my_string.substring(start, len)-like syntax, you can write a custom trait:
trait StringUtils {
fn substring(&self, start: usize, len: usize) -> Self;
}
impl StringUtils for String {
fn substring(&self, start: usize, len: usize) -> Self {
self.chars().skip(start).take(len).collect()
}
}
// Usage:
fn main() {
let phrase: String = "this is a string".to_string();
println!("{}", phrase.substring(5, 8)); // prints "is a str"
}

The solution given by oli_obk does not handle last index of string slice. It can be fixed with .chain(once(s.len())).
Here function substr implements a substring slice with error handling. If invalid index is passed to function, then a valid part of string slice is returned with Err-variant. All corner cases should be handled correctly.
fn substr(s: &str, begin: usize, length: Option<usize>) -> Result<&str, &str> {
use std::iter::once;
let mut itr = s.char_indices().map(|(n, _)| n).chain(once(s.len()));
let beg = itr.nth(begin);
if beg.is_none() {
return Err("");
} else if length == Some(0) {
return Ok("");
}
let end = length.map_or(Some(s.len()), |l| itr.nth(l-1));
if let Some(end) = end {
return Ok(&s[beg.unwrap()..end]);
} else {
return Err(&s[beg.unwrap()..s.len()]);
}
}
let s = "abc🙂";
assert_eq!(Ok("bc"), substr(s, 1, Some(2)));
assert_eq!(Ok("c🙂"), substr(s, 2, Some(2)));
assert_eq!(Ok("c🙂"), substr(s, 2, None));
assert_eq!(Err("c🙂"), substr(s, 2, Some(99)));
assert_eq!(Ok(""), substr(s, 2, Some(0)));
assert_eq!(Err(""), substr(s, 5, Some(4)));
Note that this does not handle unicode grapheme clusters. For example, "y̆es" contains 4 unicode chars but 3 grapheme clusters. Crate unicode-segmentation solves this problem. Unicode grapheme clusters are handled correctly if part
let mut itr = s.char_indices()...
is replaced with
use unicode_segmentation::UnicodeSegmentation;
let mut itr = s.grapheme_indices(true)...
Then also following works
assert_eq!(Ok("y̆"), substr("y̆es", 0, Some(1)));

Knowing about the various syntaxes of the slice type might be beneficial for some of the readers.
Reference to a part of a string
&s[6..11]
If you start at index 0, you can omit the value
&s[0..1] ^= &s[..1]
Equivalent if your substring contains the last byte of the string
&s[3..s.len()] ^= &s[3..]
This also applies when the slice encompasses the entire string
&s[..]
You can also use the range inclusive operator to include the last value
&s[..=1]
Link to docs: https://doc.rust-lang.org/book/ch04-03-slices.html

I would suggest you use the crate substring. (And look at its source code if you want to learn how to do this properly.)

I couldn't find the exact substr implementation that I'm familiar with from other programming languages like: JavaScript, Dart, and etc.
Here is possible implementation of method substr to &str and String
Let's define a trait for making able to implement functions to default types, (like extensions in Dart).
trait Substr {
fn substr(&self, start: usize, end: usize) -> String;
}
Then implement this trait for &str
impl<'a> Substr for &'a str {
fn substr(&self, start: usize, end: usize) -> String {
if start > end || start == end {
return String::new();
}
self.chars().skip(start).take(end - start).collect()
}
}
Try:
fn main() {
let string = "Hello, world!";
let substring = string.substr(0, 4);
println!("{}", substring); // Hell
}

You can also use .to_string()[ <range> ].
This example takes an immutable slice of the original string, then mutates that string to demonstrate the original slice is preserved.
let mut s: String = "Hello, world!".to_string();
let substring: &str = &s.to_string()[..6];
s.replace_range(..6, "Goodbye,");
println!("{} {} universe!", s, substring);
// Goodbye, world! Hello, universe!

I'm not very experienced in Rust but I gave it a try. If someone could correct my answer please don't hesitate.
fn substring(string:String, start:u32, end:u32) -> String {
let mut substr = String::new();
let mut i = start;
while i < end + 1 {
substr.push_str(&*(string.chars().nth(i as usize).unwrap().to_string()));
i += 1;
}
return substr;
}
Here is a playground

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Split a string keeping the separators - rust

Related

Take N elements with saving that satisfies predicate

How to make this iterator/for loop idiomatic in Rust

How to shuffle a vector except for the first and last elements without using third party libraries?

How to trim space less than n times?

Is there a method like JavaScript's substr in Rust?

Categories

Resources