How to call regexes in a loop without cloning the data - rust

I am writing some code to call an N number of regexes over contents and if possible, I'd like to avoid cloning the strings all the time as not every regex would actually be a match. Is that even possible?
My code where I tried to do is this:
use std::borrow::Cow;
use regex::Regex;
fn main() {
let test = "abcde";
let regexes = vec![
(Regex::new("a").unwrap(), "b"),
(Regex::new("b").unwrap(), "c"),
(Regex::new("z").unwrap(), "-"),
];
let mut contents = Cow::Borrowed(test);
for (regex, new_value) in regexes {
contents = regex.replace_all(&contents, new_value);
}
println!("{}", contents);
}
The expected result there would be cccde (if it worked) and two clones. But to make it work, I have to keep cloning on every iteration:
fn main() {
let test = "abcde";
let regexes = vec![
(Regex::new("a").unwrap(), "b"),
(Regex::new("b").unwrap(), "c"),
(Regex::new("z").unwrap(), "-"),
];
let mut contents = test.to_string();
for (regex, new_value) in regexes {
contents = regex.replace_all(&contents, new_value).to_string();
}
println!("{}", contents);
}
Which then outputs cccde but with 3 clones.
Is it possible to avoid it somehow? I know I could call every regex and rebind the return but I do not have control over the amount of regex that could come.
Thanks in advance!
EDIT 1
For those who want to see the real code:
It is doing O(n^2) regexes operations.
It starts here https://github.com/jaysonsantos/there-i-fixed-it/blob/ad214a27606bc595d80bb7c5968d4f80ac032e65/src/plan/executor.rs#L185-L192 and calls this https://github.com/jaysonsantos/there-i-fixed-it/blob/main/src/plan/mod.rs#L107-L115
EDIT 2
Here is the new code with the accepted answer https://github.com/jaysonsantos/there-i-fixed-it/commit/a4f5916b3e80749de269efa219b0689cb08551f2

You can do it by using a string as the persistent owner of the string as it is being replaced, and on each iteration, checking if the returned Cow is owned. If it is owned, you know the replacement was successful, so you assign the string that is owned by the Cow into the loop variable.
let mut contents = test.to_owned();
for (regex, new_value) in regexes {
let new_contents = regex.replace_all(&contents, new_value);
if let Cow::Owned(new_string) = new_contents {
contents = new_string;
}
}
Note that assignment in Rust is by default a 'move' - this means that the value of new_string is moved rather than copied into contents.
Playground

Related

Insert constructed string into Vec in Rust

Right now I am writing a program where I am updating a Vec with a string constructed based on conditions in a for loop. A (very contrived) simplified form of what I'm trying to do is the following:
fn main() {
let mut arr = vec!["_"; 5];
for (i, chr) in "abcde".char_indices() {
arr[i] = &chr.to_string().repeat(3);
}
}
However, I am getting the error temporary value dropped while borrowed. Any pointers on what to do here?
The lifetime of arr is the scope of the main method while chr.to_string() is only valid in the body of the for loop. Assigning it causes the error.
You can avoid this problem by using a Vec<String> instead of Vec<&str>.
fn main() {
let mut arr = vec!["_".to_string(); 5];
for (i, chr) in "abcde".char_indices() {
arr[i] = chr.to_string().repeat(3);
}
}
Here we see the String "_".to_string() copied five times (which is not very efficient). I suspect this is not the case in your real code.
Using String try this one liner too:
let arr: Vec<String> = "abcde".chars().map(|c| c.to_string().repeat(3)).collect();
Output:
["aaa", "bbb", "ccc", "ddd", "eee"]
As others have mentioned, the String that you are creating have to be owned by someone, otherwise they end up being dropped. As the compiler detects that this drop occurs while your arrays still holds references to them, it will complain.
You need to think about who needs to own these values. If your eventual array is the natural place for them to live, just move them there:
fn main() {
let mut arr = Vec::with_capacity(5);
for (i, chr) in "abcde".char_indices() {
arr.push(chr.to_string().repeat(3));
}
}
If you absolutely need an array of &str, you still need to maintain these values 'alive' for at least as long as the references themselves:
fn i_only_consume_refs(data: Vec<&String>) -> () {}
fn main() {
let mut arr = Vec::with_capacity(5);
for (i, chr) in "abcde".char_indices() {
arr.push(chr.to_string().repeat(3));
}
let refs = arr.iter().collect();
i_only_consume_refs(refs)
}
Here, we are still moving all the created Strings to the vector arr, and THEN taking references on its elements. This way, the vector of references is valid as long as arr (who owns the strings) is.
TL;DR: Someone needs to own these Strings while you keep references to them. You cannot create temporary strings, and only store the reference, otherwise you will have a reference to a dropped value, which is very bad indeed, and the compiler will not let you do that.
The problem is that arr only holds references, and the strings inside must be owned elsewhere. A possible fix is to simply leak the transient String you created inside the loop.
fn main() {
let mut arr = vec!["_"; 5];
for (i, chr) in "abcde".char_indices() {
arr[i] = Box::leak(Box::new(chr.to_string().repeat(3)));
}
}

How to translate "x-y" to vec![x, x+1, … y-1, y]?

This solution seems rather inelegant:
fn parse_range(&self, string_value: &str) -> Vec<u8> {
let values: Vec<u8> = string_value
.splitn(2, "-")
.map(|part| part.parse().ok().unwrap())
.collect();
{ values[0]..(values[1] + 1) }.collect()
}
Since splitn(2, "-") returns exactly two results for any valid string_value, it would be better to assign the tuple directly to two variables first and last rather than a seemingly arbitrary-length Vec. I can't seem to do this with a tuple.
There are two instances of collect(), and I wonder if it can be reduced to one (or even zero).
Trivial implementation
fn parse_range(string_value: &str) -> Vec<u8> {
let pos = string_value.find(|c| c == '-').expect("No valid string");
let (first, second) = string_value.split_at(pos);
let first: u8 = first.parse().expect("Not a number");
let second: u8 = second[1..].parse().expect("Not a number");
{ first..second + 1 }.collect()
}
Playground
I would recommend returning a Result<Vec<u8>, Error> instead of panicking with expect/unwrap.
Nightly implementation
My next thought was about the second collect. Here is a code example which uses nightly code, but you won't need any collect at all.
#![feature(conservative_impl_trait, inclusive_range_syntax)]
fn parse_range(string_value: &str) -> impl Iterator<Item = u8> {
let pos = string_value.find(|c| c == '-').expect("No valid string");
let (first, second) = string_value.split_at(pos);
let first: u8 = first.parse().expect("Not a number");
let second: u8 = second[1..].parse().expect("Not a number");
first..=second
}
fn main() {
println!("{:?}", parse_range("3-7").collect::<Vec<u8>>());
}
Instead of calling collect the first time, just advance the iterator:
let mut values = string_value
.splitn(2, "-")
.map(|part| part.parse().unwrap());
let start = values.next().unwrap();
let end = values.next().unwrap();
Do not call .ok().unwrap() — that converts the Result with useful error information to an Option, which has no information. Just call unwrap directly on the Result.
As already mentioned, if you want to return a Vec, you'll want to call collect to create it. If you want to return an iterator, you can. It's not bad even in stable Rust:
fn parse_range(string_value: &str) -> std::ops::Range<u8> {
let mut values = string_value
.splitn(2, "-")
.map(|part| part.parse().unwrap());
let start = values.next().unwrap();
let end = values.next().unwrap();
start..end + 1
}
fn main() {
assert!(parse_range("1-5").eq(1..6));
}
Sadly, inclusive ranges are not yet stable, so you'll need to continue to use +1 or switch to nightly.
Since splitn(2, "-") returns exactly two results for any valid string_value, it would be better to assign the tuple directly to two variables first and last rather than a seemingly arbitrary-length Vec. I can't seem to do this with a tuple.
This is not possible with Rust's type system. You are asking for dependent types, a way for runtime values to interact with the type system. You'd want splitn to return a (&str, &str) for a value of 2 and a (&str, &str, &str) for a value of 3. That gets even more complicated when the argument is a variable, especially when it's set at run time.
The closest workaround would be to have a runtime check that there are no more values:
assert!(values.next().is_none());
Such a check doesn't feel valuable to me.
See also:
What is the correct way to return an Iterator (or any other trait)?
How do I include the end value in a range?

How do I reuse the SplitWhitespace iterator?

I've got a piece of code which is supposed to check if two sentences are "too similar", as defined by a heuristic made clearest by the code.
fn too_similar(thing1: &String, thing2: &String) -> bool {
let split1 = thing1.split_whitespace();
let split2 = thing2.split_whitespace();
let mut matches = 0;
for s1 in split1 {
for s2 in split2 {
if s1.eq(s2) {
matches = matches + 1;
break;
}
}
}
let longer_length =
if thing1.len() > thing2.len() {
thing1.len()
} else {
thing2.len()
};
matches > longer_length / 2
}
However, I'm getting the following compilation error:
error[E0382]: use of moved value: `split2`
--> src/main.rs:7:19
|
7 | for s2 in split2 {
| ^^^^^^ value moved here in previous iteration of loop
|
= note: move occurs because `split2` has type `std::str::SplitWhitespace<'_>`, which does not implement the `Copy` trait
I'm not sure why split2 is getting moved in the first place, but what's the Rust way of writing this function?
split2 is getting moved because iterating with for consumes the iterator and since the type does not implement Copy, Rust isn't copying it implicitly.
You can fix this by creating a new iterator inside the first for:
let split1 = thing1.split_whitespace();
let mut matches = 0;
for s1 in split1 {
for s2 in thing2.split_whitespace() {
if s1.eq(s2) {
matches = matches + 1;
break;
}
}
}
...
You can also rewrite the matches counting loop using some higher order functions available in the Iterator trait:
let matches = thing1.split_whitespace()
.flat_map(|c1| thing2.split_whitespace().filter(move |&c2| c1 == c2))
.count();
longer_length can also be written as:
let longer_length = std::cmp::max(thing1.len(), thing2.len());
There are possibly some better ways to do the word comparison.
If the phrases are long, then iterating over thing2's words for every word in thing1 is not very efficient. If you don't have to worry about words which appear more than once, then HashSet may help, and boils the iteration down to something like:
let words1: HashSet<&str> = thing1.split_whitespace().collect();
let words2: HashSet<&str> = thing2.split_whitespace().collect();
let matches = words1.intersection(&words2).count();
If you do care about repeated words you probably need a HashMap, and something like:
let mut words_hash1: HashMap<&str, usize> = HashMap::new();
for word in thing1.split_whitespace() {
*words_hash1.entry(word).or_insert(0) += 1;
}
let matches2: usize = thing2.split_whitespace()
.map(|s| words_hash1.get(s).cloned().unwrap_or(0))
.sum();

How to allocate a string before you know how big it needs to be

I'm sure this is a beginners mistake. My code is:
...
let mut latest_date : Option<Date<Local>> = None;
let mut latest_datetime : Option<DateTime<Local>> = None;
let mut latest_activity : Option<&str> = None;
for wrapped_line in reader.lines() {
let line = wrapped_line.unwrap();
println!("line: {}", line);
if date_re.is_match(&line) {
let captures = date_re.captures(&line).unwrap();
let year = captures.at(1).unwrap().parse::<i32>().unwrap();
let month = captures.at(2).unwrap().parse::<u32>().unwrap();
let day = captures.at(3).unwrap().parse::<u32>().unwrap();
latest_date = Some(Local.ymd(year, month, day));
println!("date: {}", latest_date.unwrap());
}
if time_activity_re.is_match(&line) && latest_date != None {
let captures = time_activity_re.captures(&line).unwrap();
let hour = captures.at(1).unwrap().parse::<u32>().unwrap();
let minute = captures.at(2).unwrap().parse::<u32>().unwrap();
let activity = captures.at(3).unwrap();
latest_datetime = Some(latest_date.unwrap().and_hms(hour, minute, 0));
latest_activity = if activity.len() > 0 {
Some(activity)
} else {
None
};
println!("time activity: {} |{}|", latest_datetime.unwrap(), activity);
}
}
...
My error is:
Compiling tt v0.1.0 (file:///home/chris/cloud/tt)
src/main.rs:69:55: 69:59 error: `line` does not live long enough
src/main.rs:69 let captures = time_activity_re.captures(&line).unwrap();
^~~~
src/main.rs:55:5: 84:6 note: in this expansion of for loop expansion
src/main.rs:53:51: 86:2 note: reference must be valid for the block suffix following statement 7 at 53:50...
src/main.rs:53 let mut latest_activity : Option<&str> = None;
src/main.rs:54
src/main.rs:55 for wrapped_line in reader.lines() {
src/main.rs:56 let line = wrapped_line.unwrap();
src/main.rs:57 println!("line: {}", line);
src/main.rs:58
...
src/main.rs:56:42: 84:6 note: ...but borrowed value is only valid for the block suffix following statement 0 at 56:41
src/main.rs:56 let line = wrapped_line.unwrap();
src/main.rs:57 println!("line: {}", line);
src/main.rs:58
src/main.rs:59 if date_re.is_match(&line) {
src/main.rs:60 let captures = date_re.captures(&line).unwrap();
src/main.rs:61 let year = captures.at(1).unwrap().parse::<i32>().unwrap();
...
error: aborting due to previous error
Could not compile `tt`.
I think the problem is that the latest_activity : Option<&str> lives longer than line inside the loop iteration where latest_activity is reassigned.
Is the correct?
If so, what's the best way of fixing it. The cost of allocating a new string does not bother me, though I would prefer not to do that for each iteration.
I feel I may need a reference-counted box to put the activity in - is this the right approach?
I could allocate a String outside of the loop - but how can I do so before I know how big it will need to be?
The problem is that you are already allocating a new string for every iteration (there's nowhere for the Lines iterator to store a buffer, so it has to allocate a fresh String for each line), but you're trying to store a slice into it outside the loop.
You also can't really know how big an externally allocated String would need to be in this case... so typically you wouldn't worry about it and just resize as necessary.
The simplest way is probably to make latest_activity an Option<String>. When you want to change it, you can use .clear() followed by .push_str(s) (see the String documentation). This should re-use the existing allocation if it's large enough, resizing if it isn't. It might require some re-allocating, but nothing major (provided you don't, for example, try to store increasingly longer and longer strings).
Another possibility would be to just store wrapped_line itself, moving it out of the loop. You could store that alongside the slice indices, and then do the actual slicing outside the loop (no, you can't just store the String and the &str slice separately or together with just standard library types).

How do I get the first character out of a string?

I want to get the first character of a std::str. The method char_at() is currently unstable, as is String::slice_chars.
I have come up with the following, but it seems excessive to get a single character and not use the rest of the vector:
let text = "hello world!";
let char_vec: Vec<char> = text.chars().collect();
let ch = char_vec[0];
UTF-8 does not define what "character" is so it depends on what you want. In this case, chars are Unicode scalar values, and so the first char of a &str is going to be between one and four bytes.
If you want just the first char, then don't collect into a Vec<char>, just use the iterator:
let text = "hello world!";
let ch = text.chars().next().unwrap();
Alternatively, you can use the iterator's nth method:
let ch = text.chars().nth(0).unwrap();
Bear in mind that elements preceding the index passed to nth will be consumed from the iterator.
I wrote a function that returns the head of a &str and the rest:
fn car_cdr(s: &str) -> (&str, &str) {
for i in 1..5 {
let r = s.get(0..i);
match r {
Some(x) => return (x, &s[i..]),
None => (),
}
}
(&s[0..0], s)
}
Use it like this:
let (first_char, remainder) = car_cdr("test");
println!("first char: {}\nremainder: {}", first_char, remainder);
The output looks like:
first char: t
remainder: est
It works fine with chars that are more than 1 byte.
Get the first single character out of a string w/o using the rest of that string:
let text = "hello world!";
let ch = text.chars().take(1).last().unwrap();
It would be nice to have something similar to Haskell's head function and tail function for such cases.
I wrote this function to act like head and tail together (doesn't match exact implementation)
pub fn head_tail<T: Iterator, O: FromIterator<<T>::Item>>(iter: &mut T) -> (Option<<T>::Item>, O) {
(iter.next(), iter.collect::<O>())
}
Usage:
// works with Vec<i32>
let mut val = vec![1, 2, 3].into_iter();
println!("{:?}", head_tail::<_, Vec<i32>>(&mut val));
// works with chars in two ways
let mut val = "thanks! bedroom builds YT".chars();
println!("{:?}", head_tail::<_, String>(&mut val));
// calling the function with Vec<char>
let mut val = "thanks! bedroom builds YT".chars();
println!("{:?}", head_tail::<_, Vec<char>>(&mut val));
NOTE: The head_tail function doesn't panic! if the iterator is empty. If this matched Haskell's head/tail output, this would have thrown an exception if the iterator was empty. It might also be good to use iterable trait to be more compatible to other types.
If you only want to test for it, you can use starts_with():
"rust".starts_with('r')
"rust".starts_with(|c| c == 'r')
I think it is pretty straight forward
let text = "hello world!";
let c: char = text.chars().next().unwrap();
next() takes the next item from the iterator
To “unwrap” something in Rust is to say, “Give me the result of the computation, and if there was an error, panic and stop the program.”
The accepted answer is a bit ugly!
let text = "hello world!";
let ch = &text[0..1]; // this returns "h"

Resources