Iterating through a Window of a String without collect - string

I need to iterate through and compare a window of unknown length of a string. My current implementation works, however I've done performance tests against it, and it is very inefficient. The method needs to be guaranteed to be safe against Unicode.
fn foo(line: &str, patt: &str) {
for window in line.chars().collect::<Vec<char>>().windows(patt.len()) {
let mut bar = String::new();
for ch in window {
bar.push(*ch);
}
// perform various comparison checks
}
}

An improvement on Shepmaster's final solution, which significantly lowers overhead (by a factor of ~1.5), is
fn foo(line: &str, pattern: &str) -> bool {
let pattern_len = pattern.chars().count();
let starts = line.char_indices().map(|(i, _)| i);
let mut ends = line.char_indices().map(|(i, _)| i);
// Itertools::dropping
if pattern_len != 0 { ends.nth(pattern_len - 1); }
for (start, end) in starts.zip(ends.chain(Some(line.len()))) {
let bar = &line[start..end];
if bar == pattern { return true }
}
false
}
That said, your code from the Github page is a little odd. For instance, you try to deal with different length open and close tags with a wordier version of
let length = cmp::max(comment.len(), comment_end.len());
but your check
if window.contains(comment)
could then trigger multiple times!
Much better would be to just iterate over shrinking slices. In the mini example this would be
fn foo(line: &str, pattern: &str) -> bool {
let mut chars = line.chars();
loop {
let bar = chars.as_str();
if bar.starts_with(pattern) { return true }
if chars.next().is_none() { break }
}
false
}
(Note that this once again ends up again improving performance by another factor of ~1.5.)
and in a larger example this would be something like
let mut is_in_comments = 0u64;
let start = match line.find(comment) {
Some(start) => start,
None => return false,
};
let end = match line.rfind(comment_end) {
Some(end) => end,
None => return true,
};
let mut chars = line[start..end + comment_end.len()].chars();
loop {
let window = chars.as_str();
if window.starts_with(comment) {
if nested {
is_in_comments += 1;
} else {
is_in_comments = 1;
}
} else if window.starts_with(comment_end) {
is_in_comments = is_in_comments.saturating_sub(1);
}
if chars.next().is_none() { break }
}
Note that this still counts overlaps, so /*/ might count as an opening /* immediately followed by a closing */.

The method needs to be guaranteed to be safe against Unicode.
pattern.len() returns the number of bytes that the string requires, so it's already possible that your code is doing the wrong thing. I might suggest you check out tools like QuickCheck to produce arbitrary strings that include Unicode.
Here's my test harness:
use std::iter;
fn main() {
let mut haystack: String = iter::repeat('a').take(1024*1024*100).collect();
haystack.push('b');
println!("{}", haystack.len());
}
And I'm compiling and timing via cargo build --release && time ./target/release/x. Creating the string by itself takes 0.274s.
I used this version of your original code just to have some kind of comparison:
fn foo(line: &str, pattern: &str) -> bool {
for window in line.chars().collect::<Vec<char>>().windows(pattern.len()) {
let mut bar = String::new();
for ch in window {
bar.push(*ch);
}
if bar == pattern { return true }
}
false
}
This takes 4.565s, or 4.291s for just foo.
The first thing I see is that there is a lot of allocation happening on the inner loop. The code creates, allocates, and destroys the String for each iteration. Let's reuse the String allocation:
fn foo_mem(line: &str, pattern: &str) -> bool {
let mut bar = String::new();
for window in line.chars().collect::<Vec<char>>().windows(pattern.len()) {
bar.clear();
bar.extend(window.iter().cloned());
if bar == pattern { return true }
}
false
}
This takes 2.155s or 1.881s for just foo_mem.
Continuing on, another extraneous allocation is the one for the String at all. We already have bytes that look like the right thing, so let's reuse them:
fn foo_no_string(line: &str, pattern: &str) -> bool {
let indices: Vec<_> = line.char_indices().map(|(i, _c)| i).collect();
let l = pattern.chars().count();
for window in indices.windows(l + 1) {
let first_idx = *window.first().unwrap();
let last_idx = *window.last().unwrap();
let bar = &line[first_idx..last_idx];
if bar == pattern { return true }
}
// Do the last pair
{
let last_idx = indices[indices.len() - l];
let bar = &line[last_idx..];
if bar == pattern { return true }
}
false
}
This code is ugly and unidiomatic. I'm pretty sure some thinking (that I'm currently too lazy to do) would make it look a lot better.
This takes 1.409s or 1.135s for just foo_mem.
As this is ~25% of the original time, Amdahl's Law suggests this is a reasonable stopping point.

Related

Does Rust have anything similar to CallerMembrerName in C#

I'd like to know the name of the function that called my function in Rust.
In C# there's CallerMemberName attribute which tells the compiler to replace the value of a string argument to which it's applied with the name of the caller.
Does Rust have anything like that?
I don't know of a compile time solution, but you can use the backtrace functionality to resolve it at runtime.
use backtrace::Backtrace;
fn caller_name_slow() -> Option<String> {
let backtrace = Backtrace::new();
let symbolname = backtrace.frames().get(2)?.symbols().first()?.name();
symbolname.map(|s| format!("{:#?}", s))
}
fn caller_name_fast() -> Option<String> {
let mut count = 0;
let mut result = None;
backtrace::trace({
|frame| {
count += 1;
if count == 5 {
// Resolve this instruction pointer to a symbol name
backtrace::resolve_frame(frame, |symbol| {
if let Some(name) = symbol.name() {
result = Some(format!("{:#?}", name));
}
});
false
} else {
true // keep going to the next frame
}
}
});
result
}
fn my_function() {
println!("I got called by '{}'.", caller_name_slow().unwrap());
println!("I got called by '{}'.", caller_name_fast().unwrap());
}
fn main() {
my_function();
}
I got called by 'rust_tmp::main'.
I got called by 'rust_tmp::main'.
Note, however, that his is unreliable. The amount of stack frames we have to go up differs between targets and release/debug (due to inlining). For example, on my machine, in release I had to modify count == 5 to count == 2.

Consolidating multiple copies of a character at start of string into one in Rust

I'm working on a parser for a mini language, and I have the need to differentiate between plain strings ("hello") and strings that are meant to be operators/commands, and start with a specific sigil character (e.g. "$add").
I also want to add a way for the user to escape the sigil, in which a double-sigil gets consolidated into one, and then is treated like a plain string.
As an example:
"hello" becomes Str("hello")
"$add" becomes Operator(Op::Add)
"$$add" becomes Str("$add")
What would be the best way to do this check and manipulation? I was looking for a method that counts how many times a character appears at the start of a string, to no avail.
Can't you just use starts_with?
fn main() {
let line_list= [ "hello", "$add", "$$add" ];
let mut result;
for line in line_list.iter() {
if line.starts_with("$$") {
result = line[1..].to_string();
}
else if line.starts_with("$") {
result = format!("operator:{}", &line[1..]);
}
else {
result = line.to_string();
}
println!("result = {}", result);
}
}
Output
result = hello
result = operator:add
result = $add
According to the comments, your problem seems to be related to the access to the first chars.
The proper and efficient way is to get a char iterator:
#[derive(Debug)]
enum Token {
Str(String),
Operator(String),
}
impl From<&str> for Token {
fn from(s: &str) -> Self {
let mut chars = s.chars();
let first_char = chars.next();
let second_char = chars.next();
match (first_char, second_char) {
(Some('$'), Some('$')) => {
Token::Str(format!("${}", chars.as_str()))
}
(Some('$'), Some(c)) => {
// your real handling here is probably different
Token::Operator(format!("{}{}", c, chars.as_str()))
}
_ => {
Token::Str(s.to_string())
}
}
}
}
fn main() {
println!("{:?}", Token::from("π"));
println!("{:?}", Token::from("hello"));
println!("{:?}", Token::from("$add"));
println!("{:?}", Token::from("$$add"));
}
Result:
Str("π")
Str("hello")
Operator("add")
Str("$add")
playground

How to implement a lightweight long-lived thread based on a generator or asynchronous function in Rust?

I want to implement a user interaction script in the form of a lightweight, long-lived thread written in Rust. Inside the script, I have points where I asynchronously await user input.
In JavaScript, I would use a generator, inside which you can pass a question, and get back an answer, for example:
function* my_scenario() {
yield "What is your name?";
let my_name = yield "How are you feeling?";
let my_mood = yield "";
...
}
let my_session = my_scenario();
...
my_session.next("Peter");
my_session.next("happy");
However, Rust's generator method resume() contains no parameters! I cannot clone a generator or return it from a function in order to have many user sessions with different states. Instead of a generator, I thought of using an async fn(), but I do not understand how to call it at each step, passing the value there.
The return value from yield is effectively just another generator that has been implicitly passed to the first generator, except that it forces the two to be tied together in weird ways.
You can see that in your original code by the junk yield "" that you need in order to get a value even though you don't have anything to return. Additionally, your example requires that the user of the generator know the answer to the question before it is asked, which seems very unorthodox.
Explicitly pass in a second generator:
#![feature(generators, generator_trait)]
use std::{
io,
ops::{Generator, GeneratorState},
};
fn user_input() -> impl Generator<Yield = String> {
|| {
let input = io::stdin();
loop {
let mut line = String::new();
input.read_line(&mut line).unwrap();
yield line;
}
}
}
fn my_scenario(
input: impl Generator<Yield = String>,
) -> impl Generator<Yield = &'static str, Return = String> {
|| {
let mut input = Box::pin(input);
yield "What is your name?";
let my_name = match input.as_mut().resume(()) {
GeneratorState::Yielded(v) => v,
GeneratorState::Complete(_) => panic!("input did not return a value"),
};
yield "How are you feeling?";
let my_mood = match input.as_mut().resume(()) {
GeneratorState::Yielded(v) => v,
GeneratorState::Complete(_) => panic!("input did not return a value"),
};
format!("{} is {}", my_name.trim(), my_mood.trim())
}
}
fn main() {
let my_session = my_scenario(user_input());
let mut my_session = Box::pin(my_session);
loop {
match my_session.as_mut().resume(()) {
GeneratorState::Yielded(prompt) => {
println!("{}", prompt);
}
GeneratorState::Complete(v) => {
println!("{}", v);
break;
}
}
}
}
$ cargo run
What is your name?
Shep
How are you feeling?
OK
Shep is OK
You can provide hard-coded data as well:
let user_input = || {
yield "Peter".to_string();
yield "happy".to_string();
};
let my_session = my_scenario(user_input);
As of approximately Rust nightly 2020-02-08, Rust's generators now accept an argument to resume, more closely matching the original JavaScript example:
#![feature(generators, generator_trait)]
use std::{
io::{self, BufRead},
ops::{Generator, GeneratorState},
};
fn my_scenario() -> impl Generator<String, Yield = &'static str, Return = String> {
|_arg: String| {
let my_name = yield "What is your name?";
let my_mood = yield "How are you feeling?";
format!("{} is {}", my_name.trim(), my_mood.trim())
}
}
fn main() {
let my_session = my_scenario();
let mut my_session = Box::pin(my_session);
let stdin = io::stdin();
let mut lines = stdin.lock().lines();
let mut line = String::new();
loop {
match my_session.as_mut().resume(line) {
GeneratorState::Yielded(prompt) => {
println!("{}", prompt);
}
GeneratorState::Complete(v) => {
println!("{}", v);
break;
}
}
line = lines.next().expect("User input ended").expect("User input malformed");
}
}

What is an efficient way of parsing a string of single-character commands, each optionally followed by an integer repeat count?

I am implementing a robot that takes orders like L (turn left), R (turn right) and M (move forward). These orders may be augmented with a quantifier like M3LMR2 (move 3 steps, turn left, move one step, turn face). This is the equivalent of MMMLMRR.
I coded the robot structure that can understand the following enum:
pub enum Message {
TurnLeft(i8),
TurnRight(i8),
MoveForward(i8),
}
Robot::execute(&mut self, orders: Vec<Message>) is doing its job correctly.
Now, I am struggling to write something decent for the string parsing, juggling with &str, String, char and unsafe slicings because tokens can be 1 or more characters.
I have tried regular expression matching (almost worked), but I really want to tokenize the string:
fn capture(orders: &String, start: &usize, end: &usize) -> Message {
unsafe {
let order = orders.get_unchecked(start..end);
// …
};
Message::TurnLeft(1) // temporary
}
pub fn parse_orders(orders: String) -> Result<Vec<Message>, String> {
let mut messages = vec![];
let mut start: usize = 0;
let mut end: usize = 0;
while end < orders.len() && end != start {
end += 1;
match orders.get(end) {
Some('0'...'9') => continue,
_ => {
messages.push(capture(&orders, &start, &end));
start = end;
}
}
}
Ok(messages)
}
This doesn't compile and is clumsy.
The idea is to write a parser that turn the order string into a vector of Message:
let messages = parse_order("M3LMR2");
println!("Messages => {:?}", messages);
// would print
// [Message::MoveForward(3), Message::TurnLeft(1), Message::MoveForward(1), Message::TurnRight(2)]
What would be the efficient/elegant way for doing that?
You can do this very simply with an iterator, using parse and some basic String processing:
#[derive(Debug, PartialEq, Clone)]
enum Message {
TurnLeft(u8),
TurnRight(u8),
MoveForward(u8),
}
struct RobotOrders(String);
impl RobotOrders {
fn new(source: impl Into<String>) -> Self {
RobotOrders(source.into())
}
}
impl Iterator for RobotOrders {
type Item = Message;
fn next(&mut self) -> Option<Message> {
self.0.chars().next()?;
let order = self.0.remove(0);
let n_digits = self.0.chars().take_while(char::is_ascii_digit).count();
let mut number = self.0.clone();
self.0 = number.split_off(n_digits);
let number = number.parse().unwrap_or(1);
Some(match order {
'L' => Message::TurnLeft(number),
'R' => Message::TurnRight(number),
'M' => Message::MoveForward(number),
_ => unimplemented!(),
})
}
}
fn main() {
use Message::*;
let orders = RobotOrders::new("M3LMR2");
let should_be = [MoveForward(3), TurnLeft(1), MoveForward(1), TurnRight(2)];
assert!(orders.eq(should_be.iter().cloned()));
}

Possible to combine assignment and comparison in an expression?

In C, it's common to assign and compare in a single expression:
n = n_init;
do {
func(n);
} while ((n = n.next) != n_init);
As I understand it this can be expressed in Rust as:
n = n_init;
loop {
func(n);
n = n.next;
if n == n_init {
break;
}
}
Which works the same as the C version (assuming the body of the loop doesn't use continue).
Is there a more terse way to express this in Rust, or is the example above ideal?
For the purposes of this question, assume ownership or satisfying the borrow checker isn't an issue. It's up to developer to satisfy these requirements.
For example, as an integer:
n = n_init;
loop {
func(&vec[n]);
n = vec[n].next;
if n == n_init {
break;
}
}
This may seem obvious that the Rust example is idiomatic Rust - however I'm looking to move quite a lot of this style of loop to Rust, I'm interested to know if there is some better/different way to express it.
The idiomatic way to represent iteration in Rust is to use an Iterator. Thus you would implement an iterator that does the n = n.next and then use a for loop to iterate over the iterator.
struct MyIter<'a> {
pos: &'a MyData,
start: &'a MyData,
}
impl<'a> Iterator for MyIter<'a> {
type Item = &'a MyData;
fn next(&mut self) -> Option<&'a MyData> {
if self.pos as *const _ == self.start as *const _ {
None
} else {
let pos = self.pos;
self.pos = self.pos.next;
Some(pos)
}
}
}
it is left as an exercise to the reader to adapt this iterator to be able to start from the first element instead of starting from the second.
Rust supports pattern matching in if and while:
instead of having a boolean condition, the test is considered successful if the pattern matches
as part of pattern matching, you bind the values matched to names
Thus, if instead of having a boolean condition you were building an Option...
fn check(next: *mut Node, init: *mut Node) -> Option<*mut Node>;
let mut n = n_init;
loop {
func(n);
if let Some(x) = check(n.next, n_init) {
n = x;
} else {
break;
}
}
However, if you can use an Iterator instead you'll be much more idiomatic.
An assignment in Rust returns the empty tuple. If you are fine with non-idiomatic code you can compare the assignment-result with such an empty tuple and use a logical conjunction to chain your actual loop condition.
let mut current = 3;
let mut parent;
while (parent = get_parent(current)) == () && parent != current {
println!("currently {}, parent is {}", current, parent);
current = parent;
}
// example function
fn get_parent(x: usize) -> usize {
if x > 0 { x - 1 } else { x }
}
// currently 3, parent is 2
// currently 2, parent is 1
// currently 1, parent is 0
This has the disadvantage that entering the loop needs to run logic (which you can avoid with C's do {..} while(); style loops).
You can use this approach inside a do-while macro, but readability isn't that great and at that point a refactoring might be preferable. In any case, this is how it could look:
do_it!({
println!("{}", n);
} while (n = n + 1) == () && n < 4);
This is the code for the macro:
macro_rules! do_it {
($b: block while $e:expr) => {
loop {
$b
if !($e) { break };
}
}
}

Resources