Writing Rust-y code: Keeping references to different structs depending on type of object in XML - rust

I'm having a hard time formulating this in a rust-y manner, since my brain is still hardwired in Python. So I have a XML file:
<xml>
<car>
<name>First car</name>
<brand>Volvo</brand>
</car>
<plane>
<name>First plane</name>
<brand>Boeing</brand>
</plane>
<car>
<name>Second car</name>
<brand>Volvo</brand>
</car>
</xml>
In reality it's much more complex and the XML is about 500-1000MB large. I'm reading this using quick-xml which gives me events such as Start (tag start), Text and End (tag end) and I'm doing a state machine to keep track.
Now I want to off-load the parsing of car and plane to different modules (they need to be handled differently) but share a base-implementation/trait.
So far so good.
Now using my state machine I know when I need to offload to the car or the plane:
When I enter the main car tag I want to create a new instance of car
After that, offload everything until the corresponding </car> to it
When we reach the end I'm going to call .save() on the car implementation to store it elsewhere, and can free/destroy the instance.
But this means in my main loop I need to create a new instance of the car and keep track of it (and the same for plane if that's the main element.
let mut current_xml_section: I_DONT_KNOW_THE_TYPE = Some()
loop {
match reader.read_event(&mut buf) {
Ok(Event::Start(ref e)) => {
if state == State::Unknown {
match e.name() {
b"car" => {
state = State::InSection;
current_section = CurrentSection::Car;
state_at_depth = depth;
current_xml_section = CurrentSection::Car::new(e); // this won't work
},
b"plane" => {
state = State::InSection;
current_section = CurrentSection::Plane;
state_at_depth = depth;
current_xml_section = CurrentSection::Plane::new(e); // this won't work
},
_ => (),
};
}else{
current_xml_section.start_tag(e); // this won't work
}
depth += 1;
},
Ok(Event::End(ref e)) => {
depth -= 1;
if state == State::InSection && state_at_depth == depth {
state = State::Unknown;
current_section = CurrentSection::Unknown;
state_at_depth = 0;
current_xml_section.save(); // this won't work
// Free current_xml_section here
}else{
if state == State::InSection {
current_xml_section.end_tag(e) // this won't work
}
}
},
// unescape and decode the text event using the reader encoding
Ok(Event::Text(e)) => (
if state == State::InSection {
current_xml_section.text_data(e) // this won't work
}
),
Ok(Event::Eof) => break, // exits the loop when reaching end of file
Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
_ => (), // There are several other `Event`s we do not consider here
}
// if we don't keep a borrow elsewhere, we can clear the buffer to keep memory usage low
buf.clear();
}
}
So I basically don't know how to keep a reference in the main loop to the "current" object (I'm sorry, Python term), given that:
We may or may not have a current tag we're processing
That section might be a reference to either Car or Plane
I've also considered:
Use Serde, but it's a massive document and frankly I don't know the entire structure of it (I'm black box decoding it) so it would need to be passed to Serde in chunks (and I didn't manage to do that, even though I tried)
Keeping a reference to the latest plane, the latest car (and start by creating blank objects outside of the main loop) but it feels ugly
Using Generics
Any nudge in the right direction would be welcome as I try to un-Python my brain!

Event driven parsing of XML lends itself particularly well to a scope driven approach, where each level is parsed by a different function.
For example, your main loop could look like this:
loop {
match reader.read_event(&mut buf) {
Ok(Event::Start(ref e)) => {
match e.name() {
b"car" => handle_car(&mut reader, &mut buf)?,
b"plane" => handle_plane(&mut reader, &mut buf)?,
_ => return Err("Unexpected Tag"),
}
},
Ok(Event::Eof) => break,
_ => (),
}
}
Note that the inner match statement only has to consider the XML tags that can occur at the top level; any other tag is unexpected and should generate an error.
handle_car would look something like this:
fn handle_car(reader: &mut Reader<&[u8]>, buf:&mut Vec<u8>) -> Result<(),ErrType> {
let mut car = Car::new();
loop {
match reader.read_event(buf) {
Ok(Event::Start(ref e)) => {
match e.name() {
b"name" => {
car.name = handle_name(reader, buf)?;
},
b"brand" => {
car.brand = handle_brand(reader, buf)?;
},
_ => return Err("bad tag"),
}
},
Ok(Event::End(ref e)) => break,
Ok(Event::Eof) => return Err("Unexpected EOF"),
_ => (),
}
}
car.save();
Ok(())
}
handle_car creates its own instance of Car, which lives within the scope of that function. It has its own loop where it handles all the tags that can occur within it. If those tags contain yet more tags, you just introduce a new set of handling functions for them. The function returns a Result so that if the input structure does not match expectations the error can be passed up (as can any errors produced by quick_xml, which I have ignored but real code would handle).
This pattern has some advantages when parsing XML:
The structure of the code matches the expected structure of the XML, making it easier to read and understand.
The state is implicit in the structure of the code. No need for state variables or depth counters.
Common tags, that appear in multiple places (such as <name> and <brand> can be handled by common functions that are re-used
If the XML format you are parsing has nested structures (eg. if <car> could contain another <car>) this is handled by recursion.
Your original problem of not knowing how to store the Car / Plane within the main loop is completely avoided.

Related

Reduce nested match statements with unwrap_or_else in Rust

I have a rust program that has multiple nested match statements as shown below.
match client.get(url).send() {
Ok(mut res) => {
match res.read_to_string(&mut s) {
Ok(m) => {
match get_auth(m) {
Ok(k) => k,
Err(_) => return Err(“a”);
}
},
Err(_) => {
return Err(“b”);
}
}
},
Err(_) => {
return Err(“c”);
},
};
All the variables k and m are of type String.I am looking for a way to make the code more readable by removing excessive nested match statements keeping the error handling intact since both the output and the error types are important for the problem.Is it possible to achieve this by unwrap_or_else?
The .map_err() utility converts a Result to have a new error type, leaving the success type alone. It accepts a closure that consumes the existing error value and returns the new one.
The ? operator will early-return the error in the Err case, and unwrap in the Ok case.
Combining these two allows you to express this same flow succinctly:
get_auth(
client.get(url).send().map_err(|_| "c")?
.read_to_string(&mut s).map_err(|_| "b")?
).map_err(|_| "a")?
(I suspect that you actually want to pass s to get_auth() but that's not what the code in your question does, so I'm choosing to represent the code you posted instead of imaginary code that I'm guessing about.)

Specifying shape of dataset when creating HDF5 file

I have a function with the following signature:
fn create_something( ... ) -> ndarray::ArrayBase<ndarray::OwnedRepr<f64>, ndarray::Dim<[usize; 2]>> {
// Does stuff
}
I then call this function and I would like to save the data into a HDF5 file:
let data = create_something( ... );
let file = hdf5::File::create("./output/example.h5"); // open for writing
let group = file.create_group("example_group"); // create a group
let data_set = group.new_dataset::<f64>().create("example");
let write_res = data_set.write(data.view());
match write_res {
Ok(_) => (),
Err(error) => panic!("Error: {:?}", error)
}
Resulting in
thread 'main' panicked at 'Error: shape mismatch when writing: memory = [1950, 24191], destination = []', src/main.rs:203:27
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
So there seems to be a problem with the shape, which is reasonable since I haven't really specified the shape of the dataset... What I don't understand, is how I can specify this (I find the documentation on this very confusing...) and how exactly one could append data to the data-set if we need to know its exact shape at creation...
The basic issue is that you need to tell the HDF5 library the shape of the in-file dataset.
HDF5 is extremely flexible, the cost of which is complexity. It's possible to convert data between a file and the in-memory representation in a whole host of ways, including reshaping, transposing, swapping byte order, converting types, and much more.
For example, you can read every other sample from a file, into a contiguous array in memory. The details of these two different representations is captured by a dataspace. These can be different for an in-file and in-memory dataset.
In your example, you've declared a dataset with a particular name and datatype (f64), but given the HDF5 library no other information about it. The library does not assume that, because you've provided a contiguous array of some size, that the datapsace is that size, again, because you can read/write between a dataspaces that are of different sizes.
In this case, all you need to add is the shape of the dataset.
fn main() {
let data = vec![0.0, 1.0, 2.0];
let file = hdf5::File::create("./example.h5").unwrap(); // open for writing
let group = file.create_group("example_group").unwrap(); // create a group
let data_set = group
.new_dataset::<f64>()
.shape([data.len()])
.create("example")
.unwrap();
let write_res = data_set.write(&data);
match write_res {
Ok(_) => (),
Err(error) => panic!("Error: {:?}", error),
}
}
You can see the file is appropriately filled:
$ h5dump example.h5
HDF5 "example.h5" {
GROUP "/" {
GROUP "example_group" {
DATASET "example" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): 0, 1, 2
}
}
}
}
}

Idiomatic rust way to properly parse Clap ArgMatches

I'm learning rust and trying to make a find like utility (yes another one), im using clap and trying to support command line and config file for the program's parameters(this has nothing to do with the clap yml file).
Im trying to parse the commands and if no commands were passed to the app, i will try to load them from a config file.
Now I don't know how to do this in an idiomatic way.
fn main() {
let matches = App::new("findx")
.version(crate_version!())
.author(crate_authors!())
.about("find + directory operations utility")
.arg(
Arg::with_name("paths")
...
)
.arg(
Arg::with_name("patterns")
...
)
.arg(
Arg::with_name("operation")
...
)
.get_matches();
let paths;
let patterns;
let operation;
if matches.is_present("patterns") && matches.is_present("operation") {
patterns = matches.values_of("patterns").unwrap().collect();
paths = matches.values_of("paths").unwrap_or(clap::Values<&str>{"./"}).collect(); // this doesn't work
operation = match matches.value_of("operation").unwrap() { // I dont like this
"Append" => Operation::Append,
"Prepend" => Operation::Prepend,
"Rename" => Operation::Rename,
_ => {
print!("Operation unsupported");
process::exit(1);
}
};
}else if Path::new("findx.yml").is_file(){
//TODO: try load from config file
}else{
eprintln!("Command line parameters or findx.yml file must be provided");
process::exit(1);
}
if let Err(e) = findx::run(Config {
paths: paths,
patterns: patterns,
operation: operation,
}) {
eprintln!("Application error: {}", e);
process::exit(1);
}
}
There is an idiomatic way to extract Option and Result types values to the same scope, i mean all examples that i have read, uses match or if let Some(x) to consume the x value inside the scope of the pattern matching, but I need to assign the value to a variable.
Can someone help me with this, or point me to the right direction?
Best Regards
Personally I see nothing wrong with using the match statements and folding it or placing it in another function. But if you want to remove it there are many options.
There is the ability to use the .default_value_if() method which is impl for clap::Arg and have a different default value depending on which match arm is matched.
From the clap documentation
//sets value of arg "other" to "default" if value of "--opt" is "special"
let m = App::new("prog")
.arg(Arg::with_name("opt")
.takes_value(true)
.long("opt"))
.arg(Arg::with_name("other")
.long("other")
.default_value_if("opt", Some("special"), "default"))
.get_matches_from(vec![
"prog", "--opt", "special"
]);
assert_eq!(m.value_of("other"), Some("default"));
In addition you can add a validator to your operation OR convert your valid operation values into flags.
Here's an example converting your match arm values into individual flags (smaller example for clarity).
extern crate clap;
use clap::{Arg,App};
fn command_line_interface<'a>() -> clap::ArgMatches<'a> {
//Sets the command line interface of the program.
App::new("something")
.version("0.1")
.arg(Arg::with_name("rename")
.help("renames something")
.short("r")
.long("rename"))
.arg(Arg::with_name("prepend")
.help("prepends something")
.short("p")
.long("prepend"))
.arg(Arg::with_name("append")
.help("appends something")
.short("a")
.long("append"))
.get_matches()
}
#[derive(Debug)]
enum Operation {
Rename,
Append,
Prepend,
}
fn main() {
let matches = command_line_interface();
let operation = if matches.is_present("rename") {
Operation::Rename
} else if matches.is_present("prepend"){
Operation::Prepend
} else {
//DEFAULT
Operation::Append
};
println!("Value of operation is {:?}",operation);
}
I hope this helps!
EDIT:
You can also use Subcommands with your specific operations. It all depends on what you want to interface to be like.
let app_m = App::new("git")
.subcommand(SubCommand::with_name("clone"))
.subcommand(SubCommand::with_name("push"))
.subcommand(SubCommand::with_name("commit"))
.get_matches();
match app_m.subcommand() {
("clone", Some(sub_m)) => {}, // clone was used
("push", Some(sub_m)) => {}, // push was used
("commit", Some(sub_m)) => {}, // commit was used
_ => {}, // Either no subcommand or one not tested for...
}

Why is a match pattern with a guard clause not exhaustive?

Consider the following code sample (playground).
#[derive(PartialEq, Clone, Debug)]
enum State {
Initial,
One,
Two,
}
enum Event {
ButtonOne,
ButtonTwo,
}
struct StateMachine {
state: State,
}
impl StateMachine {
fn new() -> StateMachine {
StateMachine {
state: State::Initial,
}
}
fn advance_for_event(&mut self, event: Event) {
// grab a local copy of the current state
let start_state = self.state.clone();
// determine the next state required
let end_state = match (start_state, event) {
// starting with initial
(State::Initial, Event::ButtonOne) => State::One,
(State::Initial, Event::ButtonTwo) => State::Two,
// starting with one
(State::One, Event::ButtonOne) => State::Initial,
(State::One, Event::ButtonTwo) => State::Two,
// starting with two
(State::Two, Event::ButtonOne) => State::One,
(State::Two, Event::ButtonTwo) => State::Initial,
};
self.transition(end_state);
}
fn transition(&mut self, end_state: State) {
// update the state machine
let start_state = self.state.clone();
self.state = end_state.clone();
// handle actions on entry (or exit) of states
match (start_state, end_state) {
// transitions out of initial state
(State::Initial, State::One) => {}
(State::Initial, State::Two) => {}
// transitions out of one state
(State::One, State::Initial) => {}
(State::One, State::Two) => {}
// transitions out of two state
(State::Two, State::Initial) => {}
(State::Two, State::One) => {}
// identity states (no transition)
(ref x, ref y) if x == y => {}
// ^^^ above branch doesn't match, so this is required
// _ => {},
}
}
}
fn main() {
let mut sm = StateMachine::new();
sm.advance_for_event(Event::ButtonOne);
assert_eq!(sm.state, State::One);
sm.advance_for_event(Event::ButtonOne);
assert_eq!(sm.state, State::Initial);
sm.advance_for_event(Event::ButtonTwo);
assert_eq!(sm.state, State::Two);
sm.advance_for_event(Event::ButtonTwo);
assert_eq!(sm.state, State::Initial);
}
In the StateMachine::transition method, the code as presented does not compile:
error[E0004]: non-exhaustive patterns: `(Initial, Initial)` not covered
--> src/main.rs:52:15
|
52 | match (start_state, end_state) {
| ^^^^^^^^^^^^^^^^^^^^^^^^ pattern `(Initial, Initial)` not covered
But that is exactly the pattern I'm trying to match! Along with the (One, One) edge and (Two, Two) edge. Importantly I specifically want this case because I want to leverage the compiler to ensure that every possible state transition is handled (esp. when new states are added later) and I know that identity transitions will always be a no-op.
I can resolve the compiler error by uncommenting the line below that one (_ => {}) but then I lose the advantage of having the compiler check for valid transitions because this will match on any states added in the future.
I could also resolve this by manually typing each identity transition such as:
(State::Initial, State::Initial) => {}
That is tedious, and at that point I'm just fighting the compiler. This could probably be turned into a macro, so I could possibly do something like:
identity_is_no_op!(State);
Or worst case:
identity_is_no_op!(State::Initial, State::One, State::Two);
The macro can automatically write this boilerplate any time a new state is added, but that feels like unnecessary work when the pattern I have written should be covering the exact case that I'm looking for.
Why doesn't this work as written?
What is the cleanest way to do what I am trying to do?
I have decided that a macro of the second form (i.e. identity_is_no_op!(State::Initial, State::One, State::Two);) is actually the preferred solution.
It's easy to imagine a future in which I do want some of the states to do something in the 'no transition' case. Using this macro would still have the desired effect of forcing the state machine to be revisited when new States are added and would just require adding the new state to the macro arglist if nothing needs to be done. A reasonable compromise IMO.
I think the question is still useful because the behavior was surprising to me as a relatively new Rustacean.
Why doesn't this work as written?
Because the Rust compiler cannot take guard expressions into account when determining if a match is exhaustive. As soon as you have a guard, it assumes that guard could fail.
Note that this has nothing to do with the distinction between refutable and irrefutable patterns. if guards are not part of the pattern, they're part of the match syntax.
What is the cleanest way to do what I am trying to do?
List every possible combination. A macro could slightly shorten writing the patterns, but you can't use macros to replace entire match arms.

What is the proper way to consume pieces of a slice in a loop when the piece size can change each iteration?

I need to process a slice of bytes into fixed chunks, but the chunk pattern is only known at runtime:
pub fn process(mut message: &[u8], pattern: &[Pattern]) {
for element in pattern: {
match element {
(...) => {
let (chunk, message2) = message.split_at(element.size);
/* process chunk */
message = message2;
},
// ...
}
}
}
It feels awkward to have to use this message2. But if I do
let (chunk, message) = message.split_at(element.size);
Then it does not work, I assume because this actually creates a new messages binding that goes out of scope between loop iterations.
Is there a more elegant way to do this?
You are correct in your reasoning that let (chunk, message) = message.split_at(element.size); creates a new binding message within that scope and does not update the outer message value.
What you are looking for is a 'destructuring assignment' of the tuple. This would allow tuple elements to be assigned to existing variable bindings instead of creating new bindings, something like:
let chunk;
(chunk, message) = message.split_at(element.size);
Unfortunately this is currently not possible in Rust. You can see a pre-RFC which proposes to add destructuring assignment to the Rust language.
I believe what you currently have is a perfectly fine solution, perhaps rename message2 to something like rest_of_message or remaining_message.

Resources