Deserializing multiple documents with `serde_yaml`

Deserializing multiple documents with `serde_yaml` - rust

I am saving in append mode a stream of events on a YAML log file, where each event is represented by an indivual document, like this:
---
type: event
id: 1
---
type: trigger
id: 2
At some point later I want to iterate on these events, parsing each via serde_yaml. To my understanding though, serde_yaml doesn't seem to support parsing multiple documents from a single reader, as none of the available methods mention it, and trying to parse multiple documents at once results in a MoreThanOneDocument error.
use std::io::{self, BufRead};
use serde_yaml;
use serde::{self, Deserialize};
#[derive(Deserialize, Debug)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum Message {
Event { id: i32 },
Trigger { id: i32},
}
fn main() -> io::Result<()> {
let yaml = "---\ntype: event\nid: 1\n---\n\ntype: trigger\nid: 2";
let v: Message = serde_yaml::from_reader(yaml.as_bytes()).unwrap();
println!("{:?}", v);
Ok(())
}
I'm totally new to Rust, so maybe I completely missed the point of serde and just did not understand how to do it.
How would you parse such YAML, please?
I cooked up something that looks like a working solution, but I think I'll try to post it among the answers instead, because I don't want to bias other answers too much towards my solution. I kindly encourage you to have a look at it as well however, any feedback is welcome.

The documentation of serde_yaml::Deserializer shows an example very similar to yours. It would work like this:
use serde::Deserialize;
#[derive(Deserialize, Debug)]
#[serde(tag = "type", rename_all = "snake_case")]
pub enum Message {
Event { id: i32 },
Trigger { id: i32 },
}
fn main() {
let yaml = "---\ntype: event\nid: 1\n---\ntype: trigger\nid: 2\n";
for document in serde_yaml::Deserializer::from_str(yaml) {
let v = Message::deserialize(document).unwrap();
println!("{:?}", v);
}
}

I really hope to find a native solution by using serde and serde_yaml only, but until then the way I got it working is as follows.
trait BufReaderYamlExt {
fn read_next_yaml(&mut self) -> io::Result<Option<String>>;
}
impl<T: io::Read> BufReaderYamlExt for io::BufReader<T> {
fn read_next_yaml(&mut self) -> io::Result<Option<String>> {
const sep : &str = "\n---\n";
let mut doc = String::with_capacity(200);
while self.read_line(&mut doc)? > 0 {
if doc.len() > sep.len() && doc.ends_with(sep) {
doc.truncate(doc.len() - sep.len());
break;
}
}
if !doc.is_empty() {
doc.shrink_to_fit();
Ok(Some(doc))
} else {
Ok(None)
}
}
}
The trait extends the BufReader with an extra method that returns an optional owned String (or None at the end of the stream) containing just the portion with a single YAML document.
By iterating on it one could then apply serde_json::from_str() to parse the document into a Message struct.
fn main() -> io::Result<()> {
let yaml = "---\ntype: event\nid: 1\n\n---\n\ntype: trigger\nid: 2\n";
let mut r = io::BufReader::new(yaml.as_bytes());
while let Some(next) = r.read_next_yaml()? {
let d: Message = serde_yaml::from_str(&next).unwrap();
println!("parsed: {:?}", d);
}
Ok(())
}
I've made available the full source on the rust playground as well.

Related

How can I easily get a reference to a value after it has been moved into a tuple-type enum variant?

I want to move a value into a tuple-type enum variant and obtain a reference to the value after it has been moved. I see how this is possible with an if let statement, but this seems like this should be unnecessary when the particular variant is known statically.
Is there any way to get the reference to the moved value without requiring an if let or match?
This code block is a simple illustration of my question (see below for a more challenging case):
enum Transport {
Car(u32), // horsepower
Horse(String), // name
}
fn do_something(x: &String) {
println!(x);
}
fn main() {
// Can I avoid needing this if, which is clearly redundant?
if let Transport::Horse(ref name) = Transport::Horse("daisy".into()) {
do_something(name);
}
else {
// Can never happen
}
// I tried the following, it gives:
// "error[E0005]: refutable pattern in local binding: `Car(_)` not covered"
let Transport::Horse(ref name) = Transport::Horse("daisy".into());
}
It is easy to find ways to side-step the issue in the above code, since there are no real interface requirements. Consider instead the following example, where I am building a simple API for building trees (where each node can have n children). Nodes have an add_child_node method returning a reference to the node that was added, to allow chaining of calls to quickly build deep trees. (It is debatable whether this is a good API, but that is irrelevant to the question). add_child_node must return a mutable reference to the contents of an enum variant. Is the if let required in this example (without changing the API)?
struct Node {
children: Vec<Child>,
// ...
}
enum Child {
Node(Node),
Leaf
}
impl Node {
fn add_child_node(&mut self, node: Node) -> &mut Node {
self.children.push(Child::Node(node));
// It seems like this if should be unnecessary
if let Some(&mut Child::Node(ref mut x)) = self.children.last() {
return x;
}
// Required to compile, since we must return something
unreachable!();
}
fn add_child_leaf(&mut self) {
// ...
}
}

No. You can use unreachable!() for the else case, and it's usually clear even without message/comment what's going on. The compiler is also very likely to optimize the check away.

If the variants have the same type you can implement AsRef and use the Transport as a &str:
enum Transport {
Car(String),
Horse(String),
}
fn do_something<S: AsRef<str>>(x: &S) {
println!("{}", x.as_ref());
}
impl AsRef<str> for Transport {
fn as_ref(&self) -> &str {
match self {
Transport::Car(s) => s,
Transport::Horse(s) => s,
}
}
}
fn main() {
let transport = Transport::Horse("daisy".into());
do_something(&transport)
}
Playground
Otherwise you need to use a let if binding as you are doing. No need to use an else clause if you don't want to:
if let Transport::Horse(ref name) = Transport::Horse("daisy".into()) {
do_something(name);
}

define From<Transport> for String:
…
impl From<Transport> for String {
fn from(t: Transport) -> String {
match t {
Transport::Car(value) => value.to_string(),
Transport::Horse(name) => name,
}
}
}
fn do_something(x: Transport) {
println!("{}", String::from(x));
}
fn main() {
let horse = Transport::Horse("daisy".to_string());
let car = Transport::Car(150);
do_something(horse);
do_something(car);
}

Storing an iterator for a HashMap in a struct

Edit
As it seemms from the suggested solution, What I'm trying to achieve seems impossible/Not the correct way, therefore - I'll explain the end goal here:
I am parsing the values for Foo from a YAML file using serde, and I would like to let the user get one of those stored values from the yaml at a time, this is why I wanted to store an iterator in my struct
I have two struct similar to the following:
struct Bar {
name: String,
id: u32
}
struct Foo {
my_map: HashMap<String, Bar>
}
In my Foo struct, I wish to store an iterator to my HashMap, so a user can borrow values from my map on demand.
Theoretically, the full Foo class would look something like:
struct Foo {
my_map: HashMap<String, Bar>,
my_map_iter: HashMap<String, Bar>::iterator
}
impl Foo {
fn get_pair(&self) -> Option<(String, Bar)> {
// impl...
}
}
But I can't seem to pull it off and create such a variable, no matter what I try (Various compilation errors which seems like I'm just trying to do that wrong).
I would be glad if someone can point me to the correct way to achieve that and if there is a better way to achieve what I'm trying to do - I would like to know that.
Thank you!

I am parsing the values for Foo from a YAML file using serde
When you parse them you should put the values in a Vec instead of a HashMap.
I imagine the values you have also have names which is why you thought a HashMap would be good. You could instead store them like so:
let parsed = vec![]
for _ in 0..n_to_parse {
// first item of the tuple is the name second is the value
let key_value = ("Get from", "serde");
parsed.push(key_value);
}
then once you stored it like so it will be easy to get the pairs from it by keeping track of the current index:
struct ParsedHolder {
parsed: Vec<(String, String)>,
current_idx: usize,
}
impl ParsedHolder {
fn new(parsed: Vec<(String, String)>) -> Self {
ParsedHolder {
parsed,
current_idx: 0,
}
}
fn get_pair(&mut self) -> Option<&(String, String)> {
if let Some(pair) = self.parsed.get(self.current_idx) {
self.current_idx += 1;
Some(pair)
} else {
self.current_idx = 0;
None
}
}
}
Now this could be further improved upon by using VecDeque which will allow you to efficiently take out the first element of parsed. Which will make it easy to not use clone. But this way you will be only able to go through all the parsed values once which I think is actually what you want in your use case.
But I'll let you implement VecDeque 😃

The reason why this is a hard is that unless we make sure the HashMap isn't mutated while we iterate we could get into some trouble. To make sure the HashMap is immutable until the iterator lives:
use std::collections::HashMap;
use std::collections::hash_map::Iter;
struct Foo<'a> {
my_map: &'a HashMap<u8, u8>,
iterator: Iter<'a, u8, u8>,
}
fn main() {
let my_map = HashMap::new();
let iterator = my_map.iter();
let f = Foo {
my_map: &my_map,
iterator: iterator,
};
}
If you can make sure or know that the HashMap won't have new keys or keys removed from it (editing values with existing keys is fine) then you can do this:
struct Foo {
my_map: HashMap<String, String>,
current_idx: usize,
}
impl Foo {
fn new(my_map: HashMap<String, String>) -> Self {
Foo {
my_map,
current_idx: 0,
}
}
fn get_pair(&mut self) -> Option<(&String, &String)> {
if let Some(pair) = self.my_map.iter().skip(self.current_idx).next() {
self.current_idx += 1;
Some(pair)
} else {
self.current_idx = 0;
None
}
}
fn get_pair_cloned(&mut self) -> Option<(String, String)> {
if let Some(pair) = self.my_map.iter().skip(self.current_idx).next() {
self.current_idx += 1;
Some((pair.0.clone(), pair.1.clone()))
} else {
self.current_idx = 0;
None
}
}
}
This is fairly inefficient though because we need to iterate though the keys to find the next key each time.

Make non-Future Rust function into a Future function?

I inherited a Rust application and I wish to make a small modification to it. Presently, it retrieves records from Cassandra in the following way using Futures:
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct Item {
pub a: String,
pub b: f64,
}
#[derive(Debug)]
pub enum DataSetError {
CassandraError(cassandra_cpp::Error),
}
pub type Result<T> = std::result::Result<T, DataSetError>;
pub fn select_cass_items(
session: &Session,
a: String,
) -> impl Future<Output = Result<Vec<Item>>> + Unpin {
let table = envmnt::get_or("TABLE", "ab_table");
let mut statement = stmt!(&("SELECT a, b FROM ".to_owned() + &table + " WHERE a = ?"));
statement.bind(0, a).unwrap();
session.execute(&statement).map(|result| {
result
.map(|rows| {
rows.iter()
.map(|row| Item {
a: row.get_by_name("a").unwrap(),
b: row.get_by_name("b").unwrap(),
})
.collect()
})
.map_err(|e| {
warn!("[select_cass_items] {:?}", e);
DataSetError::CassandraError(e)
})
})
}
I want to add the option of doing the same thing but from Parquet files. I have written a simple non-Future function (below) that does the equivalent reading/filtering operation as the Cassandra function. I've verified it works as intended.
pub fn read_parquet_file (
a: String)
-> Vec<Item> {
let reader = SerializedFileReader::try_from("/path/to/file.parquet".to_string()).unwrap();
let iter = reader.get_row_iter(None).unwrap();
iter.filter_map(|row| {
if row.get_string(0).unwrap() == &a {
Some(Item {
a: row.get_string(0).unwrap().to_string(),
b: row.get_double(1).unwrap(),
})
}
else {
None
}
}).collect::<Vec<_>>()
}
The question is: how do I convert the non-Future Parquet function to be a drop-in replacement for the Future Cassandra function? I see that the cassandra_cpp crate supports Futures, but the parquet create does not. Surely there must be a way to do this? However, I'm a Rust newbie, and I can't find any examples close enough to what I want to be able to mogrify my work into what I need. I've tried various things but they've all been dead ends, and aren't worth sharing.
Thank you!

Rust: concurrency error, program hangs after first thread

I have created a simplified version of my problem below, I have a Bag struct and Item struct. I want to spawn 10 threads that execute item_action method from Bag on each item in an item_list, and print a statement if both item's attributes are in the bag's attributes.
use std::sync::{Mutex,Arc};
use std::thread;
#[derive(Clone, Debug)]
struct Bag{
attributes: Arc<Mutex<Vec<usize>>>
}
impl Bag {
fn new(n: usize) -> Self {
let mut v = Vec::with_capacity(n);
for _ in 0..n {
v.push(0);
}
Bag{
attributes:Arc::new(Mutex::new(v)),
}
}
fn item_action(&self, item_attr1: usize, item_attr2: usize) -> Result<(),()> {
if self.attributes.lock().unwrap().contains(&item_attr1) ||
self.attributes.lock().unwrap().contains(&item_attr2) {
println!("Item attributes {} and {} are in Bag attribute list!", item_attr1, item_attr2);
Ok(())
} else {
Err(())
}
}
}
#[derive(Clone, Debug)]
struct Item{
item_attr1: usize,
item_attr2: usize,
}
impl Item{
pub fn new(item_attr1: usize, item_attr2: usize) -> Self {
Item{
item_attr1: item_attr1,
item_attr2: item_attr2
}
}
}
fn main() {
let mut item_list: Vec<Item> = Vec::new();
for i in 0..10 {
item_list.push(Item::new(i, (i+1)%10));
}
let bag: Bag= Bag::new(10); //create 10 attributes
let mut handles = Vec::with_capacity(10);
for x in 0..10 {
let bag2 = bag.clone();
let item_list2= item_list.clone();
handles.push(
thread::spawn(move || {
bag2.item_action(item_list2[x].item_attr1, item_list2[x].item_attr2);
})
)
}
for h in handles {
println!("Here");
h.join().unwrap();
}
}
When running, I only got one line, and the program just stops there without returning.
Item attributes 0 and 1 are in Bag attribute list!
May I know what went wrong? Please see code in Playground
Updated:
With suggestion from #loganfsmyth, the program can return now... but still only prints 1 line as above. I expect it to print 10 because my item_list has 10 items. Not sure if my thread logic is correct.
I have added println!("Here"); when calling join all threads. And I can see Here is printed 10 times, just not the actual log from item_action

I believe this is because Rust is not running your
if self.attributes.lock().unwrap().contains(&item_attr1) ||
self.attributes.lock().unwrap().contains(&item_attr2) {
expression in the order you expect. The evaluation order of subexpressions in Rust is currently undefined. What appears to be happening is that you essentially end up with
const condition = {
let lock1 = self.attributes.lock().unwrap();
let lock2 = self.attributes.lock().unwrap();
lock1.contains(&item_attr1) || lock2.contains(&item_attr2)
};
if condition {
which is causing your code to deadlock.
You should instead write:
let attributes = self.attributes.lock().unwrap();
if attributes.contains(&item_attr1) ||
attributes.contains(&item_attr2) {
so that there is only one lock.
Your code would also work as-is if you used an RwLock or ReentrantMutex instead of a Mutex since those allow the same thread to have multiple immutable references to the data.

"Registering" trait implementations + factory method for trait objects

Say we want to have objects implementations switched at runtime, we'd do something like this:
pub trait Methods {
fn func(&self);
}
pub struct Methods_0;
impl Methods for Methods_0 {
fn func(&self) {
println!("foo");
}
}
pub struct Methods_1;
impl Methods for Methods_1 {
fn func(&self) {
println!("bar");
}
}
pub struct Object<'a> { //'
methods: &'a (Methods + 'a),
}
fn main() {
let methods: [&Methods; 2] = [&Methods_0, &Methods_1];
let mut obj = Object { methods: methods[0] };
obj.methods.func();
obj.methods = methods[1];
obj.methods.func();
}
Now, what if there are hundreds of such implementations? E.g. imagine implementations of cards for collectible card game where every card does something completely different and is hard to generalize; or imagine implementations for opcodes for a huge state machine. Sure you can argue that a different design pattern can be used -- but that's not the point of this question...
Wonder if there is any way for these Impl structs to somehow "register" themselves so they can be looked up later by a factory method? I would be happy to end up with a magical macro or even a plugin to accomplish that.
Say, in D you can use templates to register the implementations -- and if you can't for some reason, you can always inspect modules at compile-time and generate new code via mixins; there are also user-defined attributes that can help in this. In Python, you would normally use a metaclass so that every time a new child class is created, a ref to it is stored in the metaclass's registry which allows you to look up implementations by name or parameter; this can also be done via decorators if implementations are simple functions.
Ideally, in the example above you would be able to create Object as
Object::new(0)
where the value 0 is only known at runtime and it would magically return you an Object { methods: &Methods_0 }, and the body of new() would not have the implementations hard-coded like so "methods: [&Methods; 2] = [&Methods_0, &Methods_1]", instead it should be somehow inferred automatically.

So, this is probably extremely buggy, but it works as a proof of concept.
It is possible to use Cargo's code generation support to make the introspection at compile-time, by parsing (not exactly parsing in this case, but you get the idea) the present implementations, and generating the boilerplate necessary to make Object::new() work.
The code is pretty convoluted and has no error handling whatsoever, but works.
Tested on rustc 1.0.0-dev (2c0535421 2015-02-05 15:22:48 +0000)
(See on github)
src/main.rs:
pub mod implementations;
mod generated_glue {
include!(concat!(env!("OUT_DIR"), "/generated_glue.rs"));
}
use generated_glue::Object;
pub trait Methods {
fn func(&self);
}
pub struct Methods_2;
impl Methods for Methods_2 {
fn func(&self) {
println!("baz");
}
}
fn main() {
Object::new(2).func();
}
src/implementations.rs:
use super::Methods;
pub struct Methods_0;
impl Methods for Methods_0 {
fn func(&self) {
println!("foo");
}
}
pub struct Methods_1;
impl Methods for Methods_1 {
fn func(&self) {
println!("bar");
}
}
build.rs:
#![feature(core, unicode, path, io, env)]
use std::env;
use std::old_io::{fs, File, BufferedReader};
use std::collections::HashMap;
fn main() {
let target_dir = Path::new(env::var_string("OUT_DIR").unwrap());
let mut target_file = File::create(&target_dir.join("generated_glue.rs")).unwrap();
let source_code_path = Path::new(file!()).join_many(&["..", "src/"]);
let source_files = fs::readdir(&source_code_path).unwrap().into_iter()
.filter(|path| {
match path.str_components().last() {
Some(Some(filename)) => filename.split('.').last() == Some("rs"),
_ => false
}
});
let mut implementations = HashMap::new();
for source_file_path in source_files {
let relative_path = source_file_path.path_relative_from(&source_code_path).unwrap();
let source_file_name = relative_path.as_str().unwrap();
implementations.insert(source_file_name.to_string(), vec![]);
let mut file_implementations = &mut implementations[*source_file_name];
let mut source_file = BufferedReader::new(File::open(&source_file_path).unwrap());
for line in source_file.lines() {
let line_str = match line {
Ok(line_str) => line_str,
Err(_) => break,
};
if line_str.starts_with("impl Methods for Methods_") {
const PREFIX_LEN: usize = 25;
let number_len = line_str[PREFIX_LEN..].chars().take_while(|chr| {
chr.is_digit(10)
}).count();
let number: i32 = line_str[PREFIX_LEN..(PREFIX_LEN + number_len)].parse().unwrap();
file_implementations.push(number);
}
}
}
writeln!(&mut target_file, "use super::Methods;").unwrap();
for (source_file_name, impls) in &implementations {
let module_name = match source_file_name.split('.').next() {
Some("main") => "super",
Some(name) => name,
None => panic!(),
};
for impl_number in impls {
writeln!(&mut target_file, "use {}::Methods_{};", module_name, impl_number).unwrap();
}
}
let all_impls = implementations.values().flat_map(|impls| impls.iter());
writeln!(&mut target_file, "
pub struct Object;
impl Object {{
pub fn new(impl_number: i32) -> Box<Methods + 'static> {{
match impl_number {{
").unwrap();
for impl_number in all_impls {
writeln!(&mut target_file,
" {} => Box::new(Methods_{}),", impl_number, impl_number).unwrap();
}
writeln!(&mut target_file, "
_ => panic!(\"Unknown impl number: {{}}\", impl_number),
}}
}}
}}").unwrap();
}
The generated code:
use super::Methods;
use super::Methods_2;
use implementations::Methods_0;
use implementations::Methods_1;
pub struct Object;
impl Object {
pub fn new(impl_number: i32) -> Box<Methods + 'static> {
match impl_number {
2 => Box::new(Methods_2),
0 => Box::new(Methods_0),
1 => Box::new(Methods_1),
_ => panic!("Unknown impl number: {}", impl_number),
}
}
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Deserializing multiple documents with `serde_yaml` - rust

Related

How can I easily get a reference to a value after it has been moved into a tuple-type enum variant?

Storing an iterator for a HashMap in a struct

Make non-Future Rust function into a Future function?

Rust: concurrency error, program hangs after first thread

"Registering" trait implementations + factory method for trait objects

Categories

Resources