How can I convert this loop based implementation to iteration syntax?
fn parse_number<B: AsRef<str>>(input: B) -> Option<u32> {
let mut started = false;
let mut b = String::with_capacity(50);
let radix = 16;
for c in input.as_ref().chars() {
match (started, c.is_digit(radix)) {
(false, false) => {},
(false, true) => {
started = true;
b.push(c);
},
(true, false) => {
break;
}
(true, true) => {
b.push(c);
},
}
}
if b.len() == 0 {
None
} else {
match u32::from_str_radix(b.as_str(), radix) {
Ok(v) => Some(v),
Err(_) => None,
}
}
}
The main problem that I found is that you need to terminate the iterator early and be able to ignore characters until the first numeric char is found.
.map_while() fails because it has no state.
.reduce() and .fold() would iterate over the entire str regardless if the number has already ended.
It looks like you want to find the first sequence of digits while ignoring any non-digits before that. You can use a combination of .skip_while and .take_while:
fn parse_number<B: AsRef<str>>(input: B) -> Option<u32> {
let input = input.as_ref();
let radix = 10;
let digits: String = input.chars()
.skip_while(|c| !c.is_digit(radix))
.take_while(|c| c.is_digit(radix))
.collect();
u32::from_str_radix(&digits, radix).ok()
}
fn main() {
dbg!(parse_number("I have 52 apples"));
}
[src/main.rs:14] parse_number("I have 52 apples") = Some(
52,
)
Related
Problem description
Using serde_json to deserialize a very long array of objects into a Vec<T> can take a long time, because the entire array must be read into memory up front. I'd like to iterate over the items in the array instead to avoid the up-front processing and memory requirements.
My approach so far
StreamDeserializer cannot be used directly, because it can only iterate over self-delimiting types placed back-to-back. So what I've done so far is to write a custom struct to implement Read, wrapping another Read but omitting the starting and ending square brackets, as well as any commas.
For example, the reader will transform the JSON [{"name": "foo"}, {"name": "bar"}, {"name": "baz"}] into {"name": "foo"} {"name": "bar"} {"name": "baz"} so it can be used with StreamDeserializer.
Here is the code in its entirety:
use std::io;
/// An implementation of `Read` that transforms JSON input where the outermost
/// structure is an array. The enclosing brackets and commas are removed,
/// causing the items to be adjacent to one another. This works with
/// [`serde_json::StreamDeserializer`].
pub(crate) struct ArrayStreamReader<T> {
inner: T,
depth: Option<usize>,
inside_string: bool,
escape_next: bool,
}
impl<T: io::Read> ArrayStreamReader<T> {
pub(crate) fn new_buffered(inner: T) -> io::BufReader<Self> {
io::BufReader::new(ArrayStreamReader {
inner,
depth: None,
inside_string: false,
escape_next: false,
})
}
}
#[inline]
fn do_copy(dst: &mut [u8], src: &[u8], len: usize) {
if len == 1 {
dst[0] = src[0]; // Avoids memcpy call.
} else {
dst[..len].copy_from_slice(&src[..len]);
}
}
impl<T: io::Read> io::Read for ArrayStreamReader<T> {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
if buf.is_empty() {
return Ok(0);
}
let mut tmp = vec![0u8; buf.len()];
// The outer loop is here in case every byte was skipped, which can happen
// easily if `buf.len()` is 1. In this situation, the operation is retried
// until either no bytes are read from the inner stream, or at least 1 byte
// is written to `buf`.
loop {
let byte_count = self.inner.read(&mut tmp)?;
if byte_count == 0 {
return if self.depth.is_some() {
Err(io::ErrorKind::UnexpectedEof.into())
} else {
Ok(0)
};
}
let mut tmp_pos = 0;
let mut buf_pos = 0;
for (i, b) in tmp.iter().cloned().enumerate() {
if self.depth.is_none() {
match b {
b'[' => {
tmp_pos = i + 1;
self.depth = Some(0);
},
b if b.is_ascii_whitespace() => {},
b'\0' => break,
_ => return Err(io::ErrorKind::InvalidData.into()),
}
continue;
}
if self.inside_string {
match b {
_ if self.escape_next => self.escape_next = false,
b'\\' => self.escape_next = true,
b'"' if !self.escape_next => self.inside_string = false,
_ => {},
}
continue;
}
let depth = self.depth.unwrap();
match b {
b'[' | b'{' => self.depth = Some(depth + 1),
b']' | b'}' if depth > 0 => self.depth = Some(depth - 1),
b'"' => self.inside_string = true,
b'}' if depth == 0 => return Err(io::ErrorKind::InvalidData.into()),
b',' | b']' if depth == 0 => {
let len = i - tmp_pos;
do_copy(&mut buf[buf_pos..], &tmp[tmp_pos..], len);
tmp_pos = i + 1;
buf_pos += len;
// Then write a space to separate items.
buf[buf_pos] = b' ';
buf_pos += 1;
if b == b']' {
// Reached the end of outer array. If another array
// follows, the stream will continue.
self.depth = None;
}
},
_ => {},
}
}
if tmp_pos < byte_count {
let len = byte_count - tmp_pos;
do_copy(&mut buf[buf_pos..], &tmp[tmp_pos..], len);
buf_pos += len;
}
if buf_pos > 0 {
// If at least some data was read, return with the amount. Otherwise, the outer
// loop will try again.
return Ok(buf_pos);
}
}
}
}
It is used like so:
use std::io;
use serde::Deserialize;
#[derive(Deserialize)]
struct Item {
name: String,
}
fn main() -> io::Result<()> {
let json = br#"[{"name": "foo"}, {"name": "bar"}]"#;
let wrapped = ArrayStreamReader::new_buffered(&json[..]);
let first_item: Item = serde_json::Deserializer::from_reader(wrapped)
.into_iter()
.next()
.unwrap()?;
assert_eq!(first_item.name, "foo");
Ok(())
}
At last, a question
There must be a better way to do this, right?
I copied this code from Code Review into IntelliJ IDEA to try and play around with it. I have a homework assignment that is similar to this one (I need to write a version of Linux's bc in Rust), so I am using this code only for reference purposes.
use std::io;
extern crate regex;
#[macro_use]
extern crate lazy_static;
use regex::Regex;
fn main() {
let tokenizer = Tokenizer::new();
loop {
println!("Enter input:");
let mut input = String::new();
io::stdin()
.read_line(&mut input)
.expect("Failed to read line");
let tokens = tokenizer.tokenize(&input);
let stack = shunt(tokens);
let res = calculate(stack);
println!("{}", res);
}
}
#[derive(Debug, PartialEq)]
enum Token {
Number(i64),
Plus,
Sub,
Mul,
Div,
LeftParen,
RightParen,
}
impl Token {
/// Returns the precedence of op
fn precedence(&self) -> usize {
match *self {
Token::Plus | Token::Sub => 1,
Token::Mul | Token::Div => 2,
_ => 0,
}
}
}
struct Tokenizer {
number: Regex,
}
impl Tokenizer {
fn new() -> Tokenizer {
Tokenizer {
number: Regex::new(r"^[0-9]+").expect("Unable to create the regex"),
}
}
/// Tokenizes the input string into a Vec of Tokens.
fn tokenize(&self, mut input: &str) -> Vec<Token> {
let mut res = vec![];
loop {
input = input.trim_left();
if input.is_empty() { break }
let (token, rest) = match self.number.find(input) {
Some((_, end)) => {
let (num, rest) = input.split_at(end);
(Token::Number(num.parse().unwrap()), rest)
},
_ => {
match input.chars().next() {
Some(chr) => {
(match chr {
'+' => Token::Plus,
'-' => Token::Sub,
'*' => Token::Mul,
'/' => Token::Div,
'(' => Token::LeftParen,
')' => Token::RightParen,
_ => panic!("Unknown character!"),
}, &input[chr.len_utf8()..])
}
None => panic!("Ran out of input"),
}
}
};
res.push(token);
input = rest;
}
res
}
}
/// Transforms the tokens created by `tokenize` into RPN using the
/// [Shunting-yard algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm)
fn shunt(tokens: Vec<Token>) -> Vec<Token> {
let mut queue = vec![];
let mut stack: Vec<Token> = vec![];
for token in tokens {
match token {
Token::Number(_) => queue.push(token),
Token::Plus | Token::Sub | Token::Mul | Token::Div => {
while let Some(o) = stack.pop() {
if token.precedence() <= o.precedence() {
queue.push(o);
} else {
stack.push(o);
break;
}
}
stack.push(token)
},
Token::LeftParen => stack.push(token),
Token::RightParen => {
let mut found_paren = false;
while let Some(op) = stack.pop() {
match op {
Token::LeftParen => {
found_paren = true;
break;
},
_ => queue.push(op),
}
}
assert!(found_paren)
},
}
}
while let Some(op) = stack.pop() {
queue.push(op);
}
queue
}
/// Takes a Vec of Tokens converted to RPN by `shunt` and calculates the result
fn calculate(tokens: Vec<Token>) -> i64 {
let mut stack = vec![];
for token in tokens {
match token {
Token::Number(n) => stack.push(n),
Token::Plus => {
let (b, a) = (stack.pop().unwrap(), stack.pop().unwrap());
stack.push(a + b);
},
Token::Sub => {
let (b, a) = (stack.pop().unwrap(), stack.pop().unwrap());
stack.push(a - b);
},
Token::Mul => {
let (b, a) = (stack.pop().unwrap(), stack.pop().unwrap());
stack.push(a * b);
},
Token::Div => {
let (b, a) = (stack.pop().unwrap(), stack.pop().unwrap());
stack.push(a / b);
},
_ => {
// By the time the token stream gets here, all the LeftParen
// and RightParen tokens will have been removed by shunt()
unreachable!();
},
}
}
stack[0]
}
When I run it, however, it gives me this error:
error[E0308]: mismatched types
--> src\main.rs:66:22
|
66 | Some((_, end)) => {
| ^^^^^^^^ expected struct `regex::Match`, found tuple
|
= note: expected type `regex::Match<'_>`
found type `(_, _)`
It's complaining that I am using a tuple for the Some() method when I am supposed to use a token. I am not sure what to pass for the token, because it appears that the tuple is traversing through the Token options. How do I re-write this to make the Some() method recognize the tuple as a Token? I have been working on this for a day but I have not found any really good solutions.
The code you are referencing is over two years old. Notably, that predates regex 1.0. Version 0.1.80 defines Regex::find as:
fn find(&self, text: &str) -> Option<(usize, usize)>
while version 1.0.6 defines it as:
pub fn find<'t>(&self, text: &'t str) -> Option<Match<'t>>
However, Match defines methods to get the starting and ending indices the code was written assuming. In this case, since you only care about the end index, you can call Match::end:
let (token, rest) = match self.number.find(input).map(|x| x.end()) {
Some(end) => {
// ...
I am trying get something like this (doesn't work):
match input {
"next" => current_question_number += 1,
"prev" => current_question_number -= 1,
"goto {x}" => current_question_number = x,
// ...
_ => status = "Unknown Command".to_owned()
}
I tried two different versions of Regex:
go_match = regex::Regex::new(r"goto (\d+)?").unwrap();
// ...
match input {
...
x if go_match.is_match(x) => current_question_number = go_match.captures(x).unwrap().get(1).unwrap().as_str().parse().unwrap(),
_ => status = "Unknown Command".to_owned()
}
and
let cmd_match = regex::Regex::new(r"([a-zA-Z]+) (\d+)?").unwrap();
// ...
if let Some(captures) = cmd_match.captures(input.as_ref()) {
let cmd = captures.get(1).unwrap().as_str().to_lowercase();
if let Some(param) = captures.get(2) {
let param = param.as_str().parse().unwrap();
match cmd.as_ref() {
"goto" => current_question_number = param,
}
} else {
match cmd.as_ref() {
"next" => current_question_number += 1,
"prev" => current_question_number -= 1,
}
}
} else {
status = "Unknown Command".to_owned();
}
Both seem like a ridiculously long and and complicated way to do something pretty common, am I missing something?
You can create a master Regex that captures all the interesting components then build a Vec of all the captured pieces. This Vec can then be matched against:
extern crate regex;
use regex::Regex;
fn main() {
let input = "goto 4";
let mut current_question_number = 0;
// Create a regex that matches on the union of all commands
// Each command and argument is captured
// Using the "extended mode" flag to write a nicer Regex
let input_re = Regex::new(
r#"(?x)
(next) |
(prev) |
(goto)\s+(\d+)
"#
).unwrap();
// Execute the Regex
let captures = input_re.captures(input).map(|captures| {
captures
.iter() // All the captured groups
.skip(1) // Skipping the complete match
.flat_map(|c| c) // Ignoring all empty optional matches
.map(|c| c.as_str()) // Grab the original strings
.collect::<Vec<_>>() // Create a vector
});
// Match against the captured values as a slice
match captures.as_ref().map(|c| c.as_slice()) {
Some(["next"]) => current_question_number += 1,
Some(["prev"]) => current_question_number -= 1,
Some(["goto", x]) => {
let x = x.parse().expect("can't parse number");
current_question_number = x;
}
_ => panic!("Unknown Command: {}", input),
}
println!("Now at question {}", current_question_number);
}
You have a mini language for picking questions:
pick the next question
pick the prev question
goto a specific question
If your requirements end here a Regex based solution fits perfectly.
If your DSL may evolve a parser based solution is worth considering.
The parser combinator nom is a powerful tool to build a grammar starting from basic elements.
Your language has these characteristics:
it has three alternatives statements (alt!): next, prev, goto \d+
the most complex statement "goto {number}" is composed of the keyword (tag!) goto in front of (preceded!) a number (digit!).
any numbers of whitespaces (ws!) has to be ignored
These requirements translate in this implementation:
#[macro_use]
extern crate nom;
use nom::{IResult, digit};
use nom::types::CompleteStr;
// we have for now two types of outcome: absolute or relative cursor move
pub enum QMove {
Abs(i32),
Rel(i32)
}
pub fn question_picker(input: CompleteStr) -> IResult<CompleteStr, QMove> {
ws!(input,
alt!(
map!(
tag!("next"),
|_| QMove::Rel(1)
) |
map!(
tag!("prev"),
|_| QMove::Rel(-1)
) |
preceded!(
tag!("goto"),
map!(
digit,
|s| QMove::Abs(std::str::FromStr::from_str(s.0).unwrap())
)
)
)
)
}
fn main() {
let mut current_question_number = 60;
let first_line = "goto 5";
let outcome = question_picker(CompleteStr(first_line));
match outcome {
Ok((_, QMove::Abs(n))) => current_question_number = n,
Ok((_, QMove::Rel(n))) => current_question_number += n,
Err(err) => {panic!("error: {:?}", err)}
}
println!("Now at question {}", current_question_number);
}
You can use str::split for this (playground)
fn run(input: &str) {
let mut toks = input.split(' ').fuse();
let first = toks.next();
let second = toks.next();
match first {
Some("next") => println!("next found"),
Some("prev") => println!("prev found"),
Some("goto") => match second {
Some(num) => println!("found goto with number {}", num),
_ => println!("goto with no parameter"),
},
_ => println!("invalid input {:?}", input),
}
}
fn main() {
run("next");
run("prev");
run("goto 10");
run("this is not valid");
run("goto"); // also not valid but for a different reason
}
will output
next found
prev found
found goto with number 10
invalid input "this is not valid"
goto with no parameter
I want to create a substring in Rust. It starts with an occurrence of a string and ends at the end of the string minus four characters or at a certain character.
My first approach was
string[string.find("pattern").unwrap()..string.len()-5]
That is wrong because Rust's strings are valid UTF-8 and thus byte and not char based.
My second approach is correct but too verbose:
let start_bytes = string.find("pattern").unwrap();
let mut char_byte_counter = 0;
let result = line.chars()
.skip_while(|c| {
char_byte_counter += c.len_utf8();
return start_bytes > char_byte_counter;
})
.take_while(|c| *c != '<')
.collect::<String>();
Are there simpler ways to create substrings? Is there any part of the standard library I did not find?
I don't remember a built-in library function in other languages that works exactly the way you want (give me the substring between two patterns, or between the first and the end if the second does not exist).
I think you would have to write some custom logic anyway.
The closest equivalent to a "substring" function is slicing. However (as you found out) it works with bytes, not with unicode characters, so you will have to be careful with indices. In "Löwe", the 'e' is at (byte) index 4, not 3 (playground). But you can still use it in your case, because you are not working with indices directly (using find instead to... find the index you need for you)
Here's how you could do it with slicing (bonus, you don't need to re-allocate other Strings):
// adding some unicode to check that everything works
// also ouside of ASCII
let line = "asdfapatterndf1老虎23<12";
let start_bytes = line.find("pattern").unwrap_or(0); //index where "pattern" starts
// or beginning of line if
// "pattern" not found
let end_bytes = line.find("<").unwrap_or(line.len()); //index where "<" is found
// or end of line
let result = &line[start_bytes..end_bytes]; //slicing line, returns patterndf1老虎23
Try using something like the following method:
//Return result in &str or empty &str if not found
fn between<'a>(source: &'a str, start: &'a str, end: &'a str) -> &'a str {
let start_position = source.find(start);
if start_position.is_some() {
let start_position = start_position.unwrap() + start.len();
let source = &source[start_position..];
let end_position = source.find(end).unwrap_or_default();
return &source[..end_position];
}
return "";
}
This method approximate to O(n) with char and grapheme in mind. It works, but I'm not sure if there are any bugs.
fn between(str: &String, start: String, end: String, limit_one:bool, ignore_case: bool) -> Vec<String> {
let mut result:Vec<String> = vec![];
let mut starts = start.graphemes(true);
let mut ends = end.graphemes(true);
let sc = start.graphemes(true).count();
let ec = end.graphemes(true).count();
let mut m = 0;
let mut started:bool = false;
let mut temp = String::from("");
let mut temp2 = String::from("");
for c in str.graphemes(true) {
if started == false {
let opt = starts.next();
match opt {
Some(d) => {
if (ignore_case && c.to_uppercase().cmp(&d.to_uppercase()) == std::cmp::Ordering::Equal) || c == d {
m += 1;
if m == sc {
started = true;
starts = start.graphemes(true);
}
} else {
m = 0;
starts = start.graphemes(true);
}
},
None => {
starts = start.graphemes(true);
let opt = starts.next();
match opt {
Some(e) => {
if (ignore_case && c.to_uppercase().cmp(&e.to_uppercase()) == std::cmp::Ordering::Equal) || c == e {
m += 1;
if m == sc {
started = true;
starts = start.graphemes(true);
}
}
},
None => {}
}
}
}
}
else if started == true {
let opt = ends.next();
match opt {
Some(e) => {
if (ignore_case && c.to_uppercase().cmp(&e.to_uppercase()) == std::cmp::Ordering::Equal) || c == e {
m += 1;
temp2.push_str(e);
}
else {
temp.push_str(&temp2.to_string());
temp2 = String::from("") ;
temp.push_str(c);
ends = end.graphemes(true);
}
},
None => {
ends = end.graphemes(true);
let opt = ends.next();
match opt {
Some(e) => {
if (ignore_case && c.to_uppercase().cmp(&e.to_uppercase()) == std::cmp::Ordering::Equal) || c == e {
m += 1;
temp2.push_str(e);
}
else {
temp.push_str(&temp2.to_string());
temp2 = String::from("") ;
temp.push_str(c);
ends = end.graphemes(true);
}
},
None => {
}
}
}
}
if temp2.graphemes(true).count() == end.graphemes(true).count() {
temp2 = String::from("") ;
result.push(temp);
if limit_one == true { return result; }
started = false;
temp = String::from("") ;
}
}
}
return result;
}
Can you put another match clause in one of the match results of a match like this in:
pub fn is_it_file(input_file: &str) -> String {
let path3 = Path::new(input_file);
match path3.is_file() {
true => "File!".to_string(),
false => match path3.is_dir() {
true => "Dir!".to_string(),
_ => "Don't care",
}
}
}
If not why ?
Yes you can (see Qantas' answer). But Rust often has prettier ways to do what you want. You can do multiple matches at once by using tuples.
pub fn is_it_file(input_file: &str) -> String {
let path3 = Path::new(input_file);
match (path3.is_file(), path3.is_dir()) {
(true, false) => "File!",
(false, true) => "Dir!",
_ => "Neither or Both... bug?",
}.to_string()
}
Sure you can, match is an expression:
fn main() {
fn foo() -> i8 {
let a = true;
let b = false;
match a {
true => match b {
true => 1,
false => 2
},
false => 3
}
}
println!("{}", foo()); // 2
}
You can view the results of this on the Rust playpen.
The only thing that seems off about your code to me is the inconsistent usage of .to_string() in your code, the last match case doesn't have that.