Polars: how to create a dataframe from arrow RecordBatch? - rust

I am using rust arrow library and have a RecordBatch struct. I want to create a Polars dataframe out it, do some operations in the Polars land, and move the result back to RecordBatch. Since both are arrow based, I suppose there might be an efficient way to convert back and forth.
What is the best way to convert RecordBatch to Polars Dataframe and vice versa in Rust?

Related

Creating Datafusion's Dataframe from Vec<Struct> in Rust?

I am trying to do something similar to this question here but instead of using the polars library, I will like to use the Datafusion library
The idea is to go from a vec of struct like this:
#[derive(Serialize)]
struct Test {
id:u32,
amount:u32
}
and save to Parquet files, just like in the question I referenced.
While it was possible using polars, as seen in the accepted answer to achieve this by going from the Struct, serialise to JSON and then build the Dataframe from that, I could not find similar approach using Datafusion.
Any suggestions will be appreciated.
I think the parquet_derive is designed exactly for the usecase of writing Rust structs to/from Parquet files. DataFusion would be useful if you wanted to process the resulting data, for example filtering or aggregating it with SQL
Here is an example in the docs: https://docs.rs/parquet_derive/30.0.1/parquet_derive/derive.ParquetRecordWriter.html

fastest way to serialize numpy ndarray?

I have a constant flow of medium sized ndarrays (each around 10-15mb in memory) on which I use ndarray.tobytes() before I send it to the next part of the pipeline.
Currently it takes about 70-100ms per array serialization.
I was wondering, is this the fastest that this could be done or is there a faster (maybe not as pretty) way to accomplish that?
clarification: arrays are images, next step in pipeline is some CPP function, I don't want to save them as a file.
There is no need to serialize them at all! You can let C++ read the memory directly. One way is to invoke a C++ function with the PyObject which is your NumPy array. Another is to let C++ allocate the NumPy array in the first place and populate the elements in Python before returning control to C++, for which I have some open source code built atop Boost Python that you can use: https://github.com/jzwinck/pccl/blob/master/NumPyArray.hpp
Your goal should be "zero copy" meaning you never copy the bytes of the array, you only copy references to the array or data within it plus the dimensions.

If dataframes in Spark are immutable, why are we able to modify it with operations such as withColumn()?

This is probably a stupid question originating from my ignorance. I have been working on PySpark for a few weeks now and do not have much programming experience to start with.
My understanding is that in Spark, RDDs, Dataframes, and Datasets are all immutable - which, again I understand, means you cannot change the data. If so, why are we able to edit a Dataframe's existing column using withColumn()?
As per Spark Architecture DataFrame is built on top of RDDs which are immutable in nature, Hence Data frames are immutable in nature as well.
Regarding the withColumn or any other operation for that matter, when you apply such operations on DataFrames it will generate a new data frame instead of updating the existing data frame.
However, When you are working with python which is dynamically typed language you overwrite the value of the previous reference. Hence when you are executing below statement
df = df.withColumn()
It will generate another dataframe and assign it to reference "df".
In order to verify the same, you can use id() method of rdd to get the unique identifier of your dataframe.
df.rdd.id()
will give you unique identifier for your dataframe.
I hope the above explanation helps.
Regards,
Neeraj
You aren't; the documentation explicitly says
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
If you keep a variable referring to the dataframe you called withColumn on, it won't have the new column.
The Core Data structure of Spark, i.e., the RDD itself is immutable. This nature is pretty much similar to a string in Java which is immutable as well.
When you concat a string with another literal you are not modifying the original string, you are actually creating a new one altogether.
Similarly, either the Dataframe or the Dataset, whenever you alter that RDD by either adding a column or dropping one you are not changing anything in it, instead you are creating a new Dataset/Dataframe.

How do I use nested Vecs with wasm-bindgen?

It doesn't appear that nested Vecs work with wasm-bindgen. Is that correct?
My goal is to have a Game of Life grid in Rust that I can return as rows, rather than a 1D Vec which requires the JavaScript to handle the indexing. Two workarounds I've thought of are:
Implement a sort of custom "iterator" in Rust, which is a method which returns the rows one-by-one.
Hand a 1D array to JavaScript but write a wrapper in JavaScript which handles the indexing and exposes some sort of an iterator to the consumer.
I hesitate to use either of these because I want this library to be usable by JavaScript and native Rust, and I don't think either would be very idiomatic in pure Rust land. Any other suggestions?
You're correct that wasm-bindgen today doesn't support returning types like Vec<Vec<u8>>.
A good rule of thumb for WebAssembly is that big chunks of data (like vectors) should always live in the same location to avoid losing too much performance. This means that you might want to explore an interface where a JS object wraps a pointer into WASM memory, and all of its methods work with row/column indices but modify WASM memory to keep it as the source of truth.
If that doesn't work out, then the best way to implement this today is either of the strategies you mentioned as well, although both of those require some level of JS glue code to be written as well.

How to use a WholeRowIterator as the source of another iterator?

I am trying to filter out columns after using a WholeRowIterator to filter rows. This is to remove columns that were useful in determining which row to keep, but not useful in the data returned by the scan.
The WholeRowIterator does not appear to play nice as the source of another iterator such as a RegExFilter. I know the keys/values are encoded by the WholeRowIterator.
Are there any possible solutions to get this iterator stack to work?
Thanks.
Usually, the WholeRowIterator is the last iterator in the "stack" as it involves serializing the row (many key-values) into a single key-value. You probably don't want to do it more than once. But, let's assume you want to do that:
You would want to write an Iterator which, deserializes each Key-Value into a SortedMap using the WholeRowIterator method, modify the SortedMap, reserialize it back into a single Key-Value, and then return it. This iterator would need to be assigned a priority higher than the priority given to the WholeRowIterator.
Alternatively, you could extend the WholeRowIterator and override the encodeRow(List<Key>,List<Value>) method to not serialize your unwanted columns in the first place. This would save the extra serialization and deserialization the first approach has.

Resources