Efficiently build a Polars DataFrame row by row in Rust

Efficiently build a Polars DataFrame row by row in Rust - rust

I would like to create a large Polars DataFrame using Rust, building it up row by row using data scraped from web pages. What is an efficient way to do this?
It looks like the DataFrame should be created from a Vec of Series rather than adding rows to an empty DataFrame. However, how should a Series be built up efficiently? I could create a Vec and then create a Series from the Vec, but that sounds like it will end up copying all elements. Is there a way to build up a Series element-by-element, and then build a DataFrame from those?
I will actually be building up several DataFrames in parallel using Rayon, then combining them, but it looks like vstack does what I want there. It's the creation of the individual DataFrames that I can't find out how to do efficiently.
I did look at the source of the CSV parser but that is very complicated, and probably highly optimised, but is there a simple approach that is still reasonably efficient?

pub fn from_vec(
name: &str,
v: Vec<<T as PolarsNumericType>::Native, Global>
) -> ChunkedArray<T>
Create a new ChunkedArray by taking ownership of the Vec. This operation is zero copy.
here is the link. You can then call into_series on it.

The simplest, if perhaps not the most performant, answer is to just maintain a map of vectors and turn them into the series that get fed to a DataFrame all at once.
let columns = BTreeMap::new();
for datum in get_data_from_web() {
// For simplicity suppose datum is itself a BTreeMap
// (More likely it's a serde_json::Value)
// It's assumed that every datum has the same keys; if not, the
// Vecs won't have the same length
// It's also assumed that the values of datum are all of the same known type
for (k, v) in datum {
columns.entry(k).or_insert(vec![]).push(v);
}
}
let df = DataFrame::new(
columns.into_iter()
.map(|(name, values)| Series::new(name, values))
.collect::<Vec<_>>()
).unwrap();

Related

Re-use already advanced iterator for different function

While iterating over lines in a file I need to first do "task_A" and then "task_B". The first few lines there is some data that I need to put into some data structure (task_A) and after that the lines describe how the data inside of the data structure is manipulated (task_B). Right now I use a for-loop with enumerate and if-else statements that switch depending on which file number:
let file = File::open("./example.txt").unwrap();
let reader = BufReader::new(file);
for (i, lines) in reader.lines().map(|l| l.unwrap()).enumerate() {
if i < n {
do_task_a(&lines);
} else {
do_task_b(&lines);
}
}
There is also the take_while()-method for iterators. But this only solves one part. Ideally I would pass the iterator for n steps to one function and after that to another function. I want to have a solution that only needs to iterate over the file one time.
(For anyone wondering: I want a more elegant solution for 5th day of Advent of Code 2022 Is there a way to do that? To "re-use" the iterator when it is already advanced n steps?

Looping or using an iterator adapter will consume an iterator. But if I is an iterator then so is &mut I!
You can use that instance to partially iterate through the iterator with one adapter and then continue with another. The first use consumes only the mutable reference, but not the iterator itself. For example using take:
let mut it = reader.lines().map(|l| l.unwrap());
for lines in (&mut it).take(n) {
do_task_a(&lines);
}
for lines in it {
do_task_b(&lines);
}
But I think your original code is still completely fine.

Writing expression in polars-lazy in rust

I need to write my own expression in polars_lazy. Based on my understanding from the source code I need to write a function that returns Expr::Function. The problem is that in order to construct an object of this type, an object of type FunctionOptions must be provided. The caveat is that this class is public but the members are pub(crate) and thus outside of the create one cannot construct such an object.
Are there ways around this?

I don't think you're meant to directly construct Exprs. Instead, you can use functions like polars_lazy::dsl::col() and polars_lazy::dsl::lit() to create expressions, then use methods on Expr to build up the expression. Several of those methods, such as map() and apply(), will give you an Expr::Function.

Personally I think the Rust API for polars is not well documented enough to really use yet. Although the other answer and comments mention apply and map, they don't mention how or the trade-offs. I hope this answer prompts others to correct me with the "right" way to do things.
So first, here's how to use apply on lazy dataframe, even though lazy dataframes don't take apply directly as a method as eager ones do, and mutating in-place:
// not sure how you'd find this type easily from apply documentation
let o = GetOutput::from_type(DataType::UInt32);
// this mutates two in place
let lf = lf.with_column(col("two").apply(str_to_len, o));
And here's how to use it while not mutating the source column and adding a new output column instead:
let o = GetOutput::from_type(DataType::UInt32);
// this adds new column len, two is unchanged
let lf = lf.with_column(col("two").alias("len").apply(str_to_len, o));
With the str_to_len looking like:
fn str_to_len(str_val: Series) -> Result<Series> {
let x = str_val
.utf8()
.unwrap()
.into_iter()
// your actual custom function would be in this map
.map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))
.collect::<UInt32Chunked>();
Ok(x.into_series())
}
Note that it takes Series rather than &Series and wraps in Result.
With a regular (non-lazy) dataframe, apply still mutates but doesn't require with_column:
df.apply("two", str_to_len).expect("applied");
Whereas eager/non-lazy's with_column doesn't require apply:
// the fn we use to make the column names it too
df.with_column(str_to_len(df.column("two").expect("has two"))).expect("with_column");
And str_to_len has slightly different signature:
fn str_to_len(str_val: &Series) -> Series {
let mut x = str_val
.utf8()
.unwrap()
.into_iter()
.map(|opt_name: Option<&str>| opt_name.map(|name: &str| name.len() as u32))
.collect::<UInt32Chunked>();
// NB. this is naming the chunked array, before we even get to a series
x.rename("len");
x.into_series()
}
I know there's reasons to have lazy and eager operate differently, but I wish the Rust documentation made this easier to figure out.

How to define an ordered Map/Set with a runtime-defined comparator?

This is similar to How do I use a custom comparator function with BTreeSet? however in my case I won't know the sorting criteria until runtime. The possible criteria are extensive and can't be hard-coded (think something like sort by distance to target or sort by specific bytes in a payload or combination thereof). The sorting criteria won't change after the map/set is created.
The only alternatives I see are:
use a Vec, but log(n) inserts and deletes are crucial
wrap each of the elements with the sorting criteria (directly or indirectly), but that seems wasteful
This is possible with standard C++ containers std::map/std::set but doesn't seem possible with Rust's BTreeMap/BTreeSet. Is there an alternative in the standard library or in another crate that can do this? Or will I have to implement this myself?
My use-case is a database-like system where elements in the set are defined by a schema, like:
Element {
FIELD x: f32
FIELD y: f32
FIELD z: i64
ORDERBY z
}
But since the schema is user-defined at runtime, the elements are stored in a set of bytes (BTreeSet<Vec<u8>>). Likewise the order of the elements is user-defined. So the comparator I would give to BTreeSet would look like |a, b| schema.cmp(a, b). Hard-coded, the above example may look something like:
fn cmp(a: &Vec<u8>, b: &Vec<u8>) -> Ordering {
let a_field = self.get_field(a, 2).as_i64();
let b_field = self.get_field(b, 2).as_i64();
a_field.cmp(b_field)
}

Would it be possible to pass the comparator closure as an argument to each node operation that needs it? It would be owned by the tree wrapper instead of cloned in every node.

Using Box<> with HashMap<> in Rust

I need to create a large HashMap in Rust, which is why I thought of using the Box to use the heap memory.
My question is about what is the best way to keep this data, certainly I only thought of two possible ways (anticipating that I am not so experienced with Rust).
fn main() {
let hashmap = Box<HashMap<u64, DataStruct>>;
...
}
OR
fn main() {
let hashmap = HashMap<u64, Box<DataStruct>>;
...
}
What is the best way to handle such a thing?
Thanks a lot.

HashMap already stores its data on the heap, you don't need to box your values.
Just like vectors, hash maps store their data on the heap. This
HashMap has keys of type String and values of type i32. Like vectors,
hash maps are homogeneous: all of the keys must have the same type,
and all of the values must have the same type.
Source

What's the most efficient way to insert an element into a sorted vector?

I have a sorted v: Vec<EventHandler<T>> and I want to insert an element into it while keeping it sorted. What's the most efficient way to do so? Rust doesn't seem to have a built-in way to do it.
EventHandler<T> is as follows:
struct EventHandler<T: Event + ?Sized> {
priority: i32,
f: fn(&mut T),
}
Because of how sorting works, inserting and sorting would be inefficient, with O(n log n) time complexity and 2*n allocation cost.

The task consists of two steps: finding the insert-position with binary_search and inserting with Vec::insert():
match v.binary_search(&new_elem) {
Ok(pos) => {} // element already in vector # `pos`
Err(pos) => v.insert(pos, new_elem),
}
If you want to allow duplicate elements in your vector and thus want to insert already existing elements, you can write it even shorter:
let pos = v.binary_search(&new_elem).unwrap_or_else(|e| e);
v.insert(pos, new_elem);
But: be aware that this has a runtime complexity of O(n). To insert into the middle, the vector has to move every element right of your insert-position one to the right.
So you shouldn't use it to insert more than a few elements into a vector, which isn't tiny in size. Particularly, you shouldn't use this method to sort a vector, as this insertion sort algorithm runs in O(n²).
A BinaryHeap might be a better choice in such a situation. Each insert (push) has a runtime complexity of just O(log n) instead of O(n). You can even convert it into a sorted Vec with into_sorted_vec(), if you so desire. You can also continue to use the heap instead of converting it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Efficiently build a Polars DataFrame row by row in Rust - rust

pub fn from_vec( name: &str, v: Vec<<T as PolarsNumericType>::Native, Global> ) -> ChunkedArray<T> Create a new ChunkedArray by taking ownership of the Vec. This operation is zero copy. here is the link. You can then call into_series on it.

Related

Re-use already advanced iterator for different function

Writing expression in polars-lazy in rust

How to define an ordered Map/Set with a runtime-defined comparator?

Using Box<> with HashMap<> in Rust

What's the most efficient way to insert an element into a sorted vector?

Categories

Resources