Creating Datafusion's Dataframe from Vec<Struct> in Rust?

Creating Datafusion's Dataframe from Vec<Struct> in Rust? - rust

I am trying to do something similar to this question here but instead of using the polars library, I will like to use the Datafusion library
The idea is to go from a vec of struct like this:
#[derive(Serialize)]
struct Test {
id:u32,
amount:u32
}
and save to Parquet files, just like in the question I referenced.
While it was possible using polars, as seen in the accepted answer to achieve this by going from the Struct, serialise to JSON and then build the Dataframe from that, I could not find similar approach using Datafusion.
Any suggestions will be appreciated.

I think the parquet_derive is designed exactly for the usecase of writing Rust structs to/from Parquet files. DataFusion would be useful if you wanted to process the resulting data, for example filtering or aggregating it with SQL
Here is an example in the docs: https://docs.rs/parquet_derive/30.0.1/parquet_derive/derive.ParquetRecordWriter.html

Related

Serialization in Haskell

From the bird's view, my question is: Is there a universal mechanism for as-is data serialization in Haskell?
Introduction
The origin of the problem does not root in Haskell indeed. Once, I tried to serialize a python dictionary where a hash function of objects was quite heavy. I found that in python, the default dictionary serialization does not save the internal structure of the dictionary but just dumps a list of key-value pairs. As a result, the de-serialization process is time-consuming, and there is no way to struggle with it. I was certain that there is a way in Haskell because, at my glance, there should be no problem transferring a pure Haskell type to a byte-stream automatically using BFS or DFS. Surprisingly, but it does not. This problem was discussed here (citation below)
Currently, there is no way to make HashMap serializable without modifying the HashMap library itself. It is not possible to make Data.HashMap an instance of Generic (for use with cereal) using stand-alone deriving as described by #mergeconflict's answer, because Data.HashMap does not export all its constructors (this is a requirement for GHC). So, the only solution left to serialize the HashMap seems to be to use the toList/fromList interface.
Current Problem
I have quite the same problem with Data.Trie bytestring-trie package. Building a trie for my data is heavily time-consuming and I need a mechanism to serialize and de-serialize this tire. However, it looks like the previous case, I see no way how to make Data.Trie an instance of Generic (or, am I wrong)?
So the questions are:
Is there some kind of a universal mechanism to project a pure Haskell type to a byte string? If no, is it a fundamental restriction or just a lack of implementations?
If no, what is the most painless way to modify the bytestring-trie package to make it the instance of Generic and serialize with Data.Store

There is a way using compact regions, but there is a big restriction:
Our binary representation contains direct pointers to the info tables of objects in the region. This means that the info tables of the receiving process must be laid out in exactly the same way as from the original process; in practice, this means using static linking, using the exact same binary and turning off ASLR. This API does NOT do any safety checking and will probably segfault if you get it wrong. DO NOT run this on untrusted input.
This also gives insight into universal serialization is not possible currently. Data structures contain very specific pointers which can differ if you're using different binaries. Reading in the raw bytes into another binary will result in invalid pointers.
There is some discussion in this GitHub issue about weakening this requirement.
I think the proper way is to open an issue or pull request upstream to export the data constructors in the internal module. That is what happened with HashMap which is now fully accessible in its internal module.
Update: it seems there is already a similar open issue about this.

Writing to disk parameters of Bellman in rust

So i just started using rust, and started using the bellman crate.
I used the MimC example that was added to the bellman git account, and it seems like its calculating the parameters for the circuit each time you run the example. I want to use the example as a base for my code, and it seems redundant to calculate it each time for the same circuit so I waned to try and write params to the disk, and to check each time whether it exists or not for a specific circuit (so if it was already calculated, it will read it instead of calculating it).
Assuming params is a structure, I tried using serde and serde_json. but I keep on getting the following error:
^^^^^^^ the trait serde::ser::Serialize is not implemented for bellman::groth16::Parameters<pairing::bls12_381::Bls12>
any thoughts about how can I write it and read it later efficently?
thanks!

serde has a Serialize/Deserialize traits which should be derived/implemented in the crate where the types are defined. So usually it's a good idea to look at Cargo.toml (or documentation) for serde features, it's a pretty common practice to have it (and sometimes you need manually enable them). For the bellman crate however that doesn't seem to be implemented, so you need to workaround for "external" type (explanation). Serde particularly has a fairly good support of that, take a look at their doc. Simply, you need to provide a newtype to #[serde(with = "<here-your-newtype>")], which mimics the original one.

Deserialize file using serde_json at compile time

At the beginning of my program, I read data from a file:
let file = std::fs::File::open("data/games.json").unwrap();
let data: Games = serde_json::from_reader(file).unwrap();
I would like to know how it would be possible to do this at compile time for the following reasons:
Performance: no need to deserialize at runtime
Portability: the program can be run on any machine without the need to have the json file containing the data with it.
I might also be useful to mention that, the data can be read only which means the solution can store it as static.

This is straightforward, but leads to some potential issues. First, we need to deal with something: do we want to load the tree of objects from a file, or parse that at runtime?
99% of the time, parsing on boot into a static ref is enough for people, so I'm going to give you that solution; I will point you to the "other" version at the end, but that requires a lot more work and is domain-specific.
The macro (because it has to be a macro) you are looking for to be able to include a file at compile-time is in the standard library: std::include_str!. As the name suggests, it takes your file at compile-time and generates a &'static str from it for you to use. You are then free to do whatever you like with it (such as parsing it).
From there, it is a simple matter to then use lazy_static! to generate a static ref to our JSON Value (or whatever it may be that you decide to go for) for every part of the program to use. In your case, for instance, it could look like this:
const GAME_JSON: &str = include_str!("my/file.json");
#[derive(Serialize, Deserialize, Debug)]
struct Game {
name: String,
}
lazy_static! {
static ref GAMES: Vec<Game> = serde_json::from_str(&GAME_JSON).unwrap();
}
You need to be aware of two things when doing this:
This will massively bloat your file size, as the &str isn't compressed in any way. Consider gzip
You'll need to worry about the usual concerns around multiple, threaded access to the same static ref, but since it isn't mutable you only really need to worry about a portion of it
The other way requires dynamically generating your objects at compile-time using a procedural macro. As stated, I wouldn't recommend it unless you really have a really expensive startup cost when parsing that JSON; most people will not, and the last time I had this was when dealing with deeply-nested multi-GB JSON files.
The crates you want to look out for are proc_macro2 and syn for the code generation; the rest is very similar to how you would write a normal method.

When you are deserializing something at runtime, you're essentially building some representation in program memory from another representation on disk. But at compile-time, there's no notion of "program memory" yet - where will this data deserialize too?
However, what you're trying to achieve is, in fact, possible. The main idea is like following: to create something in program memory, you must write some code which will create the data. What if you're able to generate the code automatically, based on the serialized data? That's what uneval crate does (disclaimer: I'm the author, so you're encouraged to look through the source to see if you can do better).
To use this approach, you'll have to create build.rs with approximately the following content:
// somehow include the Games struct with its Serialize and Deserialize implementations
fn main() {
let games: Games = serde_json::from_str(include_str!("data/games.json")).unwrap();
uneval::to_out_dir(games, "games.rs");
}
And in you initialization code you'll have the following:
let data: Games = include!(concat!(env!("OUT_DIR"), "/games.rs"));
Note, however, that this might be fairly hard to do in ergonomic way, since the necessary struct definitions now must be shared between the build.rs and the crate itself, as I mentioned in the comment. It might be a little easier if you split your crate in two, keeping struct definitions (and only them) in one crate, and the logic which uses them - in another one. There's some other ways - with include! trickery, or by using the fact that the build script is an ordinary Rust binary and can include other modules as well, - but this will complicate things even more.

How do I use nested Vecs with wasm-bindgen?

It doesn't appear that nested Vecs work with wasm-bindgen. Is that correct?
My goal is to have a Game of Life grid in Rust that I can return as rows, rather than a 1D Vec which requires the JavaScript to handle the indexing. Two workarounds I've thought of are:
Implement a sort of custom "iterator" in Rust, which is a method which returns the rows one-by-one.
Hand a 1D array to JavaScript but write a wrapper in JavaScript which handles the indexing and exposes some sort of an iterator to the consumer.
I hesitate to use either of these because I want this library to be usable by JavaScript and native Rust, and I don't think either would be very idiomatic in pure Rust land. Any other suggestions?

You're correct that wasm-bindgen today doesn't support returning types like Vec<Vec<u8>>.
A good rule of thumb for WebAssembly is that big chunks of data (like vectors) should always live in the same location to avoid losing too much performance. This means that you might want to explore an interface where a JS object wraps a pointer into WASM memory, and all of its methods work with row/column indices but modify WASM memory to keep it as the source of truth.
If that doesn't work out, then the best way to implement this today is either of the strategies you mentioned as well, although both of those require some level of JS glue code to be written as well.

Custom datatype (MPI_Datatype datatype)?

Is there such a thing as a custom datatype in MPI, or do you have to flatten everything into a text string and pass as MPI_CHAR? If you are required to flatten everything, is there a built-in function I am overlooking?

The answer is MPI_Type_contiguous (the link is to the documentation). It allows you to block out a specific amount of space based on basic data types, and their respective offsets.

A much better answer is MPI_Type_create_struct. It allows you to replicate your struct datatype and pass it around, there is a great example of it's use at DeinoMPI.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Creating Datafusion's Dataframe from Vec<Struct> in Rust? - rust

Related

Serialization in Haskell

Writing to disk parameters of Bellman in rust

Deserialize file using serde_json at compile time

How do I use nested Vecs with wasm-bindgen?

Custom datatype (MPI_Datatype datatype)?

Categories

Resources