How to lazily deserialize from a JSON array?

How to lazily deserialize from a JSON array? - rust

Problem description
Using serde_json to deserialize a very long array of objects into a Vec<T> can take a long time, because the entire array must be read into memory up front. I'd like to iterate over the items in the array instead to avoid the up-front processing and memory requirements.
My approach so far
StreamDeserializer cannot be used directly, because it can only iterate over self-delimiting types placed back-to-back. So what I've done so far is to write a custom struct to implement Read, wrapping another Read but omitting the starting and ending square brackets, as well as any commas.
For example, the reader will transform the JSON [{"name": "foo"}, {"name": "bar"}, {"name": "baz"}] into {"name": "foo"} {"name": "bar"} {"name": "baz"} so it can be used with StreamDeserializer.
Here is the code in its entirety:
use std::io;
/// An implementation of `Read` that transforms JSON input where the outermost
/// structure is an array. The enclosing brackets and commas are removed,
/// causing the items to be adjacent to one another. This works with
/// [`serde_json::StreamDeserializer`].
pub(crate) struct ArrayStreamReader<T> {
inner: T,
depth: Option<usize>,
inside_string: bool,
escape_next: bool,
}
impl<T: io::Read> ArrayStreamReader<T> {
pub(crate) fn new_buffered(inner: T) -> io::BufReader<Self> {
io::BufReader::new(ArrayStreamReader {
inner,
depth: None,
inside_string: false,
escape_next: false,
})
}
}
#[inline]
fn do_copy(dst: &mut [u8], src: &[u8], len: usize) {
if len == 1 {
dst[0] = src[0]; // Avoids memcpy call.
} else {
dst[..len].copy_from_slice(&src[..len]);
}
}
impl<T: io::Read> io::Read for ArrayStreamReader<T> {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
if buf.is_empty() {
return Ok(0);
}
let mut tmp = vec![0u8; buf.len()];
// The outer loop is here in case every byte was skipped, which can happen
// easily if `buf.len()` is 1. In this situation, the operation is retried
// until either no bytes are read from the inner stream, or at least 1 byte
// is written to `buf`.
loop {
let byte_count = self.inner.read(&mut tmp)?;
if byte_count == 0 {
return if self.depth.is_some() {
Err(io::ErrorKind::UnexpectedEof.into())
} else {
Ok(0)
};
}
let mut tmp_pos = 0;
let mut buf_pos = 0;
for (i, b) in tmp.iter().cloned().enumerate() {
if self.depth.is_none() {
match b {
b'[' => {
tmp_pos = i + 1;
self.depth = Some(0);
},
b if b.is_ascii_whitespace() => {},
b'\0' => break,
_ => return Err(io::ErrorKind::InvalidData.into()),
}
continue;
}
if self.inside_string {
match b {
_ if self.escape_next => self.escape_next = false,
b'\\' => self.escape_next = true,
b'"' if !self.escape_next => self.inside_string = false,
_ => {},
}
continue;
}
let depth = self.depth.unwrap();
match b {
b'[' | b'{' => self.depth = Some(depth + 1),
b']' | b'}' if depth > 0 => self.depth = Some(depth - 1),
b'"' => self.inside_string = true,
b'}' if depth == 0 => return Err(io::ErrorKind::InvalidData.into()),
b',' | b']' if depth == 0 => {
let len = i - tmp_pos;
do_copy(&mut buf[buf_pos..], &tmp[tmp_pos..], len);
tmp_pos = i + 1;
buf_pos += len;
// Then write a space to separate items.
buf[buf_pos] = b' ';
buf_pos += 1;
if b == b']' {
// Reached the end of outer array. If another array
// follows, the stream will continue.
self.depth = None;
}
},
_ => {},
}
}
if tmp_pos < byte_count {
let len = byte_count - tmp_pos;
do_copy(&mut buf[buf_pos..], &tmp[tmp_pos..], len);
buf_pos += len;
}
if buf_pos > 0 {
// If at least some data was read, return with the amount. Otherwise, the outer
// loop will try again.
return Ok(buf_pos);
}
}
}
}
It is used like so:
use std::io;
use serde::Deserialize;
#[derive(Deserialize)]
struct Item {
name: String,
}
fn main() -> io::Result<()> {
let json = br#"[{"name": "foo"}, {"name": "bar"}]"#;
let wrapped = ArrayStreamReader::new_buffered(&json[..]);
let first_item: Item = serde_json::Deserializer::from_reader(wrapped)
.into_iter()
.next()
.unwrap()?;
assert_eq!(first_item.name, "foo");
Ok(())
}
At last, a question
There must be a better way to do this, right?

Related

How to convert 2 bounded loop to iteration syntax

How can I convert this loop based implementation to iteration syntax?
fn parse_number<B: AsRef<str>>(input: B) -> Option<u32> {
let mut started = false;
let mut b = String::with_capacity(50);
let radix = 16;
for c in input.as_ref().chars() {
match (started, c.is_digit(radix)) {
(false, false) => {},
(false, true) => {
started = true;
b.push(c);
},
(true, false) => {
break;
}
(true, true) => {
b.push(c);
},
}
}
if b.len() == 0 {
None
} else {
match u32::from_str_radix(b.as_str(), radix) {
Ok(v) => Some(v),
Err(_) => None,
}
}
}
The main problem that I found is that you need to terminate the iterator early and be able to ignore characters until the first numeric char is found.
.map_while() fails because it has no state.
.reduce() and .fold() would iterate over the entire str regardless if the number has already ended.

It looks like you want to find the first sequence of digits while ignoring any non-digits before that. You can use a combination of .skip_while and .take_while:
fn parse_number<B: AsRef<str>>(input: B) -> Option<u32> {
let input = input.as_ref();
let radix = 10;
let digits: String = input.chars()
.skip_while(|c| !c.is_digit(radix))
.take_while(|c| c.is_digit(radix))
.collect();
u32::from_str_radix(&digits, radix).ok()
}
fn main() {
dbg!(parse_number("I have 52 apples"));
}
[src/main.rs:14] parse_number("I have 52 apples") = Some(
52,
)

Get the index of second element matching condition in a rust vec

Lets say I have something like
let v: Vec<bool> = [false, true, false, true, false, false];
I want to get the position of the second "true" (So in this case get_second_index(v) should return Some(3)) Currently I'm doing the following which I think is pretty ugly:
fn get_second_index(v: Vec<bool>) -> Option<u32> {
let mut num_matching = 0;
let mut second_index = 0;
for (i, b) in v.iter().enumerate() {
if *b {
num_matching += 1;
}
if num_matching == 2 {
second_index = i;
}
}
if second_index == 0 {
return None;
}
second_index
}
Is there any more elegant, more idiomatically rust way to do this? Thanks!

You can simply .enumerate() an Iterator over the Vec to get the indices, then use .filter_map() on the enumerated iterator to get all true-values, and use .nth() on the filtered iterator to get the second match:
fn second(inp: &[bool]) -> Option<usize> {
inp.iter()
.enumerate()
.filter_map(|(idx, b)| (*b).then(|| idx))
.nth(1)
}
fn main() {
let v: Vec<bool> = vec![false, true, false, true, false, false];
assert_eq!(second(&v), Some(3));
}
Notice that this will return a Option<usize>, not an Option<u32>, as all indices are usize...

Try this
fn get_second_index(v: Vec<bool>) -> Option<usize> {
let mut matched = false;
v.iter().position(|x| {
if matched {
*x
} else {
matched = *x;
false
}
})
}

Iterate through a singly linked list of Option<Rc<RefCell<Node>>> without increasing strongcount

Don't ask why I'm learning Rust using linked lists. I want to mutably iterate down a recursive structure of Option<Rc<RefCell<Node>>> while keeping the ability to swap out nodes and unwrap them. I have a singly-linked list type with a tail pointer to the last node.
pub struct List<T> {
maybe_head: Option<Rc<RefCell<Node<T>>>>,
maybe_tail: Option<Rc<RefCell<Node<T>>>>,
length: usize,
}
struct Node<T> {
value: T,
maybe_next: Option<Rc<RefCell<Node<T>>>>,
}
Let's say we have a constructor and an append function:
impl<T> List<T> {
pub fn new() -> Self {
List {
maybe_head: None,
maybe_tail: None,
length: 0,
}
}
pub fn put_first(&mut self, t: T) -> &mut Self {
let new_node_rc = Rc::new(RefCell::new(Node {
value: t,
maybe_next: mem::replace(&mut self.maybe_head, None),
}));
match self.length == 0 {
true => {
let new_node_rc_clone = new_node_rc.clone();
self.maybe_head = Some(new_node_rc);
self.maybe_tail = Some(new_node_rc_clone);
},
false => {
self.maybe_head = Some(new_node_rc);
},
}
self.length += 1;
self
}
}
I want to remove and return the final node by moving the tail pointer to its predecessor, then returning the old tail. After iterating down the list using RefCell::borrow() and Rc::clone(), the first version of remove_last() below panics when trying to unwrap the tail's Rc. How do I iterate down this recursive structure without incrementing each node's strongcount?
PANICKING VERSION
pub fn remove_last(&mut self) -> Option<T> {
let mut opt: Option<Rc<RefCell<Node<T>>>>;
if let Some(rc) = &self.maybe_head {
opt = Some(Rc::clone(rc))
} else {
return None;
};
let mut rc: Rc<RefCell<Node<T>>>;
let mut countdown_to_penultimate: i32 = self.length as i32 - 2;
loop {
rc = match opt {
None => panic!(),
Some(ref wrapped_rc) => Rc::clone(wrapped_rc),
};
match RefCell::borrow(&rc).maybe_next {
Some(ref next_rc) => {
if countdown_to_penultimate == 0 {
self.maybe_tail = Some(Rc::clone(x));
}
opt = Some(Rc::clone(next_rc));
countdown_to_penultimate -= 1;
},
None => {
let grab_tail = match Rc::try_unwrap(opt.take().unwrap()) {
Ok(something) => {
return Some(something.into_inner().value);
}
Err(_) => panic!(),
};
},
}
}
If all I do during iteration is move the tail pointer and enclose the iteration code in a {...} block to drop cloned references, I can then safely swap out and return the old tail, but this is obviously unsatisfying.
UNSATISFYING WORKING VERSION
pub fn remove_last(&mut self) -> Option<T> {
{let mut opt: Option<Rc<RefCell<Node<T>>>>;
if let Some(rc) = &self.maybe_head {
opt = Some(Rc::clone(rc))
} else {
return None;
};
let mut rc: Rc<RefCell<Node<T>>>;
let mut countdown_to_penultimate: i32 = self.length as i32 - 2;
loop {
rc = match opt {
None => panic!(),
Some(ref wrapped_rc) => Rc::clone(wrapped_rc),
};
match RefCell::borrow(&rc).maybe_next {
Some(ref next_rc) => {
if countdown_to_penultimate == 0 {
self.maybe_tail = Some(Rc::clone(&rc));
}
opt = Some(Rc::clone(next_rc));
countdown_to_penultimate -= 1;
},
None => {
break;
},
}
}}
match self.maybe_tail {
None => panic!(),
Some(ref rc) => {
let tail = mem::replace(&mut RefCell::borrow_mut(rc).maybe_next, None);
return Some(Rc::try_unwrap(tail.unwrap()).ok().unwrap().into_inner().value);
}
};
}

I wrote a List::remove_last() that I can live with, although I'd still like to know what more idiomatic Rust code here might look like. I find that this traversal idiom also extends naturally into things like removing the n-th node or removing the first node that matches some predicate.
fn remove_last(&mut self) -> Option<T> {
let mut opt: Option<Rc<RefCell<Node<T>>>>;
let mut rc: Rc<RefCell<Node<T>>>;
#[allow(unused_must_use)]
match self.length {
0 => {
return None;
}
1 => {
let head = mem::replace(&mut self.maybe_head, None);
mem::replace(&mut self.maybe_tail, None);
self.length -= 1;
return Some(
Rc::try_unwrap(head.unwrap())
.ok()
.unwrap()
.into_inner()
.value,
);
}
_ => {
opt = Some(Rc::clone(self.maybe_head.as_ref().unwrap()));
}
}
loop {
rc = match opt {
None => unreachable!(),
Some(ref wrapped_rc) => Rc::clone(wrapped_rc),
};
let mut borrowed_node = RefCell::borrow_mut(&rc);
let maybe_next = &mut borrowed_node.maybe_next;
match maybe_next {
None => unreachable!(),
Some(_)
if std::ptr::eq(
maybe_next.as_ref().unwrap().as_ptr(),
self.maybe_tail.as_ref().unwrap().as_ptr(),
) =>
{
borrowed_node.maybe_next = None;
let old_tail = self.maybe_tail.replace(Rc::clone(&rc));
self.length -= 1;
return Some(
Rc::try_unwrap(old_tail.unwrap())
.ok()
.unwrap()
.into_inner()
.value,
);
}
Some(ref next_rc) => {
opt = Some(Rc::clone(next_rc));
}
}
}
}

Why does matching on the result of Regex::find complain about expecting a struct regex::Match but found tuple?

I copied this code from Code Review into IntelliJ IDEA to try and play around with it. I have a homework assignment that is similar to this one (I need to write a version of Linux's bc in Rust), so I am using this code only for reference purposes.
use std::io;
extern crate regex;
#[macro_use]
extern crate lazy_static;
use regex::Regex;
fn main() {
let tokenizer = Tokenizer::new();
loop {
println!("Enter input:");
let mut input = String::new();
io::stdin()
.read_line(&mut input)
.expect("Failed to read line");
let tokens = tokenizer.tokenize(&input);
let stack = shunt(tokens);
let res = calculate(stack);
println!("{}", res);
}
}
#[derive(Debug, PartialEq)]
enum Token {
Number(i64),
Plus,
Sub,
Mul,
Div,
LeftParen,
RightParen,
}
impl Token {
/// Returns the precedence of op
fn precedence(&self) -> usize {
match *self {
Token::Plus | Token::Sub => 1,
Token::Mul | Token::Div => 2,
_ => 0,
}
}
}
struct Tokenizer {
number: Regex,
}
impl Tokenizer {
fn new() -> Tokenizer {
Tokenizer {
number: Regex::new(r"^[0-9]+").expect("Unable to create the regex"),
}
}
/// Tokenizes the input string into a Vec of Tokens.
fn tokenize(&self, mut input: &str) -> Vec<Token> {
let mut res = vec![];
loop {
input = input.trim_left();
if input.is_empty() { break }
let (token, rest) = match self.number.find(input) {
Some((_, end)) => {
let (num, rest) = input.split_at(end);
(Token::Number(num.parse().unwrap()), rest)
},
_ => {
match input.chars().next() {
Some(chr) => {
(match chr {
'+' => Token::Plus,
'-' => Token::Sub,
'*' => Token::Mul,
'/' => Token::Div,
'(' => Token::LeftParen,
')' => Token::RightParen,
_ => panic!("Unknown character!"),
}, &input[chr.len_utf8()..])
}
None => panic!("Ran out of input"),
}
}
};
res.push(token);
input = rest;
}
res
}
}
/// Transforms the tokens created by `tokenize` into RPN using the
/// [Shunting-yard algorithm](https://en.wikipedia.org/wiki/Shunting-yard_algorithm)
fn shunt(tokens: Vec<Token>) -> Vec<Token> {
let mut queue = vec![];
let mut stack: Vec<Token> = vec![];
for token in tokens {
match token {
Token::Number(_) => queue.push(token),
Token::Plus | Token::Sub | Token::Mul | Token::Div => {
while let Some(o) = stack.pop() {
if token.precedence() <= o.precedence() {
queue.push(o);
} else {
stack.push(o);
break;
}
}
stack.push(token)
},
Token::LeftParen => stack.push(token),
Token::RightParen => {
let mut found_paren = false;
while let Some(op) = stack.pop() {
match op {
Token::LeftParen => {
found_paren = true;
break;
},
_ => queue.push(op),
}
}
assert!(found_paren)
},
}
}
while let Some(op) = stack.pop() {
queue.push(op);
}
queue
}
/// Takes a Vec of Tokens converted to RPN by `shunt` and calculates the result
fn calculate(tokens: Vec<Token>) -> i64 {
let mut stack = vec![];
for token in tokens {
match token {
Token::Number(n) => stack.push(n),
Token::Plus => {
let (b, a) = (stack.pop().unwrap(), stack.pop().unwrap());
stack.push(a + b);
},
Token::Sub => {
let (b, a) = (stack.pop().unwrap(), stack.pop().unwrap());
stack.push(a - b);
},
Token::Mul => {
let (b, a) = (stack.pop().unwrap(), stack.pop().unwrap());
stack.push(a * b);
},
Token::Div => {
let (b, a) = (stack.pop().unwrap(), stack.pop().unwrap());
stack.push(a / b);
},
_ => {
// By the time the token stream gets here, all the LeftParen
// and RightParen tokens will have been removed by shunt()
unreachable!();
},
}
}
stack[0]
}
When I run it, however, it gives me this error:
error[E0308]: mismatched types
--> src\main.rs:66:22
|
66 | Some((_, end)) => {
| ^^^^^^^^ expected struct `regex::Match`, found tuple
|
= note: expected type `regex::Match<'_>`
found type `(_, _)`
It's complaining that I am using a tuple for the Some() method when I am supposed to use a token. I am not sure what to pass for the token, because it appears that the tuple is traversing through the Token options. How do I re-write this to make the Some() method recognize the tuple as a Token? I have been working on this for a day but I have not found any really good solutions.

The code you are referencing is over two years old. Notably, that predates regex 1.0. Version 0.1.80 defines Regex::find as:
fn find(&self, text: &str) -> Option<(usize, usize)>
while version 1.0.6 defines it as:
pub fn find<'t>(&self, text: &'t str) -> Option<Match<'t>>
However, Match defines methods to get the starting and ending indices the code was written assuming. In this case, since you only care about the end index, you can call Match::end:
let (token, rest) = match self.number.find(input).map(|x| x.end()) {
Some(end) => {
// ...

What is an efficient way of parsing a string of single-character commands, each optionally followed by an integer repeat count?

I am implementing a robot that takes orders like L (turn left), R (turn right) and M (move forward). These orders may be augmented with a quantifier like M3LMR2 (move 3 steps, turn left, move one step, turn face). This is the equivalent of MMMLMRR.
I coded the robot structure that can understand the following enum:
pub enum Message {
TurnLeft(i8),
TurnRight(i8),
MoveForward(i8),
}
Robot::execute(&mut self, orders: Vec<Message>) is doing its job correctly.
Now, I am struggling to write something decent for the string parsing, juggling with &str, String, char and unsafe slicings because tokens can be 1 or more characters.
I have tried regular expression matching (almost worked), but I really want to tokenize the string:
fn capture(orders: &String, start: &usize, end: &usize) -> Message {
unsafe {
let order = orders.get_unchecked(start..end);
// …
};
Message::TurnLeft(1) // temporary
}
pub fn parse_orders(orders: String) -> Result<Vec<Message>, String> {
let mut messages = vec![];
let mut start: usize = 0;
let mut end: usize = 0;
while end < orders.len() && end != start {
end += 1;
match orders.get(end) {
Some('0'...'9') => continue,
_ => {
messages.push(capture(&orders, &start, &end));
start = end;
}
}
}
Ok(messages)
}
This doesn't compile and is clumsy.
The idea is to write a parser that turn the order string into a vector of Message:
let messages = parse_order("M3LMR2");
println!("Messages => {:?}", messages);
// would print
// [Message::MoveForward(3), Message::TurnLeft(1), Message::MoveForward(1), Message::TurnRight(2)]
What would be the efficient/elegant way for doing that?

You can do this very simply with an iterator, using parse and some basic String processing:
#[derive(Debug, PartialEq, Clone)]
enum Message {
TurnLeft(u8),
TurnRight(u8),
MoveForward(u8),
}
struct RobotOrders(String);
impl RobotOrders {
fn new(source: impl Into<String>) -> Self {
RobotOrders(source.into())
}
}
impl Iterator for RobotOrders {
type Item = Message;
fn next(&mut self) -> Option<Message> {
self.0.chars().next()?;
let order = self.0.remove(0);
let n_digits = self.0.chars().take_while(char::is_ascii_digit).count();
let mut number = self.0.clone();
self.0 = number.split_off(n_digits);
let number = number.parse().unwrap_or(1);
Some(match order {
'L' => Message::TurnLeft(number),
'R' => Message::TurnRight(number),
'M' => Message::MoveForward(number),
_ => unimplemented!(),
})
}
}
fn main() {
use Message::*;
let orders = RobotOrders::new("M3LMR2");
let should_be = [MoveForward(3), TurnLeft(1), MoveForward(1), TurnRight(2)];
assert!(orders.eq(should_be.iter().cloned()));
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to lazily deserialize from a JSON array? - rust

Related

How to convert 2 bounded loop to iteration syntax

Get the index of second element matching condition in a rust vec

Iterate through a singly linked list of Option<Rc<RefCell<Node>>> without increasing strongcount

Why does matching on the result of Regex::find complain about expecting a struct regex::Match but found tuple?

What is an efficient way of parsing a string of single-character commands, each optionally followed by an integer repeat count?

Categories

Resources