How do I save structured data to file?

How do I save structured data to file? - rust

My program contains a huge precomputation with constant output. I would like to avoid running this precomputation in the next times I run the program. Thus, I'd like to save its output to file in the first run of the program, and just load it the next time I run the program.
The output contains non-common data types, object and structs I defined myself.
How do I go about doing that?

A de-facto standard way for (de-)serializing rust objects is serde. Given a rust struct (or enum) it produces an intermediate representation, which then can be converted to a desired format (e.g. json). Given a struct:
use serde::{Serialize, Deserialize};
// here the "magic" conversion is generated
#[derive(Debug, Serialize, Deserialize)]
struct T {
i: i32,
f: f64,
}
you can get a json representation with simple as oneliner:
let t = T { i: 1, f: 321.456 };
println!("{}", serde_json::to_string(&t).unwrap());
// prints `{"i":1,"f":321.456}`
as well as converting back:
let t: T = serde_json::from_str(r#"{"i":1,"f":321.456}"#).unwrap();
println!("i: {}, f: {}", t.i, t.f);
// prints i: 1, f: 321.456
Here is a playground link. This is an example for serde_json usage, but you may find other more suitable libraries like serde_sbor, serde_yaml, bincode, serde_xml and many many others.

You would want to use something like serde to serialize the data, save it to disk, and then restore it from there on the next run. In particular, bincode, is useful to serialize the data in the binary format which saves much more space than JSON or other human readable format. But, you have to be careful not to use old serialized data if you have changed the layout of the structures in your program.
For more in depth, I would go to Serde's documentation, but the basic idea is to mark all your structures that need saving with #[derive(Serialize, Deserialize)], and use bincode to do the serialization/deserialization. Playground link for an example I wrote using serde_json (as bincode is not available in rust playground), but bincode is no different, other than using serialize and deserialize as opposed to to_vec and from_slice.

Related

What should an idiomatic `Display` implementation write?

The trait documentation says
Display is similar to Debug, but Display is for user-facing output, and so cannot be derived.
But what does that mean? Should it write the full string-encoded value even if that results in a 500 character output? Should it make a nice and friendly representation suitable for display in a user interface even if that results in to_string() not actually returning the full value as a string?
Let me illustrate:
Say I have a type that represents important data in my application. This data has a canonical string-encoding with a very specific format.
pub struct BusinessObject {
pub name: String,
pub reference: u32,
}
First, I want to implement Display so I can use it for making easily readable log messages:
impl Display for BusinessObject {
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
// Formats like `Shampoglak (7384)`
write!(f, "{} ({})", self.name, self.reference)
}
}
Now, let's implement a method that returns the canonical standard string format for BusinessObject instances. As the as_str() method name is idiomatically only used when returning a string slice and that is not possible in this case, one could think that the most straightforward approach would be to implement a method to_string() for this.
impl BusinessObject {
fn to_string(&self) -> String {
// Formats like `Shampoglak00007384`
format!("{}{:0>8}", self.name, self.reference)
}
}
But no! This method name is already implemented as part of the automatic ToString trait implementation that we have because we implemented Display.
What does an idiomatic implementation of Display write? A full representation of a value as a string or a friendly, human-readable representation of it? How should I structure my code and name my methods if I need to implement both of those? I am specifically looking for a solution that can be applied generically and not just in this specific situation. I don't want to have to look up what the behavior of to_string() for a given struct is before I use it.
I didn't find anything in the documentation of associated traits and various Rust books and resources I looked into.

What does an idiomatic implementation of Display write? A full representation of a value as a string or a friendly, human-readable representation of it?
The latter: Display should produce a friendly, human-readable representation.
How should I structure my code and name my methods if I need to implement both of those?
"Full representations" of values as strings would more correctly be known as a string serialisation of the value. A method fn into_serialised_string(self) -> String would be one approach, but perhaps you want to consider a serialisation library like serde that separates the process of serialising (and deserialising) from the serialised format?

What does an idiomatic implementation of Display write?
It writes the obvious string form of that data. What exactly that will be depends on the data, and it may not exist for some types.
Such a string representation is not necessarily “human friendly”. For example, the Display implementation of serde_json::Value (a type which represents arbitrary JSON data structures) produces the JSON text corresponding to the value, by default without any whitespace for readability — because that's the conventional string representation.
Display should be implemented for types where you can say there is “the string representation of the data” — where there is one obvious choice. In my opinion, it should not be implemented for types where there isn't one obvious choice — instead, the API should omit it, in order to guide users of the type to think about which representation they want, rather than giving them a potentially bad default.
A full representation of a value as a string or a friendly, human-readable representation of it?
In my opinion, a Display implementation which truncates the data is incorrect. Display is for the string form of the data, not a name or snippet of the data.
How should I structure my code and name my methods if I need to implement both of those?
For convenient use in format strings, you can write one or more methods which return wrapper types that implements Display (like Path::display() in std).

Can I rely on the serialization of two identical structures being identical?

Consider the following code:
use serde::{Serialize, Deserialize};
#[derive(Serialize, Deserialize)]
struct MyStruct {
a: u32,
b: u32
}
#[derive(Serialize, Deserialize)]
#[serde(rename = "MyStruct")]
struct AlsoMyStruct {
a: u32,
b: u32
}
I am wondering if I can safely do something like:
let ser = any_encoding::serialize(&MyStruct{a: 33, b: 44}).unwrap();
let deser: AlsoMyStruct = any_encoding::deserialize(&ser).unwrap();
where any_encoding is, e.g., bincode, json, or any other Serde-supported encoding. In my head, this should work nicely: the two structures have the same name (I'm explicitly renaming AlsoMyStruct into "MyStruct") and exactly the same fields: same field names, same field types, same order of fields.
However, I am wondering: is this is actually guaranteed to work? Or is there some other, corner-case, maybe platform-dependent, unforeseen piece of information that a Serde serializer/deserializer might include in the representation of MyStruct / AlsoMyStruct that could lead to the two representations being incompatible?

In general, no, you cannot expect this to work. The reason is that neither serde nor any de/serializers guarantee that you can round-trip your data (source). This means you cannot even expect this to work in all cases if you use the same struct in both places.
For example JSON cannot round-trip Option<()> and formats which are not self-describing like bincode, do not support untagged enums.
Nothing in the type signatures enforces round-tripping.
Here are some reasons why deserialization might fail:
Using skip_serializing_none with not self-describing formats (serde #1732).
Anything which calls deserialize_any, such as untagged, adjacently tagged, or internally tagged enums (serde #1762).
Borrowing during deserialization, e.g., for &'de str or &'de [u8]. serde_json only supports &'de str if there are no escape sequences and never supports &'de [u8].
Some formats cannot serialize some types, e.g., JSON does not supports structs as map keys and bincode only supports sequences of known lengths (bincode #167).
A type only implements one of the traits (Serializer/Deserializer) or the implementations do not match, e.g., serialize as number but deserialize as string.
That being said, this can work under some circumstances. The structs should have the same name and the fields in the same order. The types or rather the Serialize/Deserialize implementations also need to support round-tripping. With Option<()> from above it also depends on the Serializer/Deserializer implementations if you can round-trip them, even if Serialize/Deserialize implementations do support it.
Many types do try to support round-tripping, since that is what most commonly is expected.

How can I sort fields in alphabetic order when serializing with serde?

I've an API that requires the object's fields to be sorted alphabetically because the struct has to be hashed.
In Java/Jackson, you can set a flag in the serializer: MapperFeature.SORT_PROPERTIES_ALPHABETICALLY. I can't find anything similar in Serde.
I'm using rmp-serde (MessagePack). It follows the annotations and serialization process used for JSON, so I thought that it would be fully compatible, but the sorting provided by #jonasbb doesn't work for it.
The struct has (a lot of) nested enums and structs which have to be flattened for the final representation. I'm using Serialize::serialize for that, but calling state.serialize_field at the right place (such that everything is alphabetical) is a pain because the enums need a match clause, so it has to be called multiple times for the same field in different places and the code is very difficult to follow.
As possible solutions, two ideas:
Create a new struct with the flat representation and sort the fields alphabetically manually.
This is a bit error prone, so a programmatic sorting solution for this flattened struct would be great.
Buffer the key values in Serialize::serialize (e.g. in a BTreeMap, which is sorted), and call state.serialize_field in a loop at the end.
The problem is that the values seem to have to be of type Serialize, which isn't object safe, so I wasn't able to figure out how to store them in the map.
How to sort HashMap keys when serializing with serde? is similar but not related because my question is about the sorting of the struct's fields/properties.

You are not writing which data format you are targetting. This makes it hard to find a solution, since some might not work in all cases.
This code works if you are using JSON (unless the preserve_order feature flag is used). The same would for for TOML by serializing into toml::Value as intermediate step.
The solution will also work for other data formats, but it might result in a different serialization, for example, emitting the data as a map instead of struct-like.
fn sort_alphabetically<T: Serialize, S: serde::Serializer>(value: &T, serializer: S) -> Result<S::Ok, S::Error> {
let value = serde_json::to_value(value).map_err(serde::ser::Error::custom)?;
value.serialize(serializer)
}
#[derive(Serialize)]
struct SortAlphabetically<T: Serialize>(
#[serde(serialize_with = "sort_alphabetically")]
T
);
#[derive(Serialize, Deserialize, Default, Debug)]
struct Foo {
z: (),
bar: (),
ZZZ: (),
aAa: (),
AaA: (),
}
println!("{}", serde_json::to_string_pretty(&SortAlphabetically(&Foo::default()))?);
because the struct has to be hashed
While field order is one source of indeterminism there are other factors too. Many formats allow different amounts of whitespace or different representations like Unicode escapes \u0066.

Serialize and Deserialize from Vec<u32>

I have a struct of 5 u32s that implements serialize/deserialize by simply serializing: (s.first, s.second, s.third, s.fourth, s.fifth).
However this needs to be packed and unpacked from a flat buffer of Vec<u32> or an Option<Vec32> that represent the data: essentially every 5 u32s is a new struct. I keep struggling with the visitor implementation. Is there an easy way to do this while sharing code between the Option and non Option cases?
I really want to do impl Serialize for Vec<MyType> (and Deserialize) but that doesn't work.

I ended up abandoning my Serialize and Deserialize impls and went with #[serde(with="my_mod"] for the Vec<MyType> case.
For the Option<Vec<MyType>> case I ended up creating wrapper types that inverted the relationship so that what I was really serializing/deserializing were Option<Wrapper { Vec<T> }>

Is it possible to implement ASN.1 DER in Rust using Serde?

AFAIK there's no (flexible and stable) ASN.1 {ser,deser}ialization library in Rust, so I'm looking into making one (while learning Rust at the same time). My goal is SNMP (v1-v3) client implementation in Rust.
Before starting from scratch, I'd like to ask Serde team or experienced Serde users if it's possible to implement ASN.1 codec with Serde. Problem is every object in ASN.1 has it's own header (TAG + LENGTH), where TAG is user defined for each type, so iXX or uXX or bytes or whatever can be any TAG.
An ASN.1 object is composed of tag, length and payload. ASN.1 has a set of universal (default) tags for integers, floats, bytestrings (as well as ASCII strings) etc. I could just stick to universal tags for primitive types, but for non-primitive types (tuples, newtypes, structs etc) the type should have an implementation of the Asn1Info trait, providing tag and custom serialize / deserialize functionality.
{ser,deser}ialization of primitive types is trivial, but how can I implement it for complex structures (or newtypes)? They must be Asn1Info.
I've looked into the asn1-cereal library. It looks like a decent ASN.1 implementation, providing useful macros and stuff. I might as well work on it instead of writing everything from scratch.
Let's assume tag is u8 and Asn1Info trait looks like this:
pub trait Asn1Info {
fn asn1_tag() -> u8;
}
Then I have a newtype like pub struct Counter(u32) with it's own application-specific tag. I'd then make an impl for Counter like this:
impl Asn1Info for Counter {
fn asn1_tag() -> u8 {
0x41
}
}
Now how do I serialize it with tag 0x41 without manually implementing Serialize trait? There's no way to inject additional information to Serializer, so I'm unable to reuse all non-primitive serialization methods in it (like serialize_newtype_variant).
If I can't use Serializer methods in Serialize trait impl for custom ASN.1 objects (application-specific, context-specific etc.), then there's no way (or no point) to implement a useful ASN.1 codec with Serde, isn't it?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I save structured data to file? - rust

Related

What should an idiomatic `Display` implementation write?

Can I rely on the serialization of two identical structures being identical?

How can I sort fields in alphabetic order when serializing with serde?

Serialize and Deserialize from Vec<u32>

Is it possible to implement ASN.1 DER in Rust using Serde?

Categories

Resources