How to do simple computations on a Polars Dataframe - rust

Hello I want to do simple computations on a polar Dataframe Columns but I have no clue of how it works and documentation is not helping. How can I compute the mean of A_col for example?
fn main() {
let df: DataFrame = example().unwrap();
let A_col:Series = df["A"];
}

Related

Converting a Utf8 Series into a Series of List<Utf8> via a custom function in Rust polars

I have a Utf8 column in my DataFrame, and from that I want to create a column of List<Utf8>.
In particular for each row I am taking the text of a HTML document and using soup to parse out all the paragraphs of class <p>, and store the collection of text of each separate paragraph as a Vec<String> or Vec<&str>. I have this as a standalone function:
fn parse_paragraph(s: &str) -> Vec<&str> {
let soup = Soup::new(s);
soup.tag(p).find_all().iter().map(|&p| p.text()).collect()
}
In trying to adapt the few available examples of applying custom functions in Rust polars, I can't seem to get the conversion to compile.
Take this MVP example, using a simpler string-to-vec-of-strings example, borrowing from the Iterators example from the documentation:
use polars::prelude::*;
fn vector_split(text: &str) -> Vec<&str> {
text.split(' ').collect()
}
fn vector_split_series(s: &Series) -> PolarsResult<Series> {
let output : Series = s.utf8()
.expect("Text data")
.into_iter()
.map(|t| t.map(vector_split))
.collect();
Ok(output)
}
fn main() {
let df = df! [
"text" => ["a cat on the mat", "a bat on the hat", "a gnat on the rat"]
].unwrap();
df.clone().lazy()
.select([
col("text").apply(|s| vector_split_series(&s), GetOutput::default())
.alias("words")
])
.collect();
}
(Note: I know there is an in-built split function for utf8 Series, but I needed a simpler example than parsing HTML)
I get the following error from cargo check:
error[E0277]: a value of type `polars::prelude::Series` cannot be built from an iterator over elements of type `Option<Vec<&str>>`
--> src/main.rs:11:27
|
11 | let output : Series = s.utf8()
| ___________________________^
12 | | .expect("Text data")
13 | | .into_iter()
14 | | .map(|t| t.map(vector_split))
| |_____________________________________^ value of type `polars::prelude::Series` cannot be built from `std::iter::Iterator<Item=Option<Vec<&str>>>`
15 | .collect();
| ------- required by a bound introduced by this call
|
= help: the trait `FromIterator<Option<Vec<&str>>>` is not implemented for `polars::prelude::Series`
= help: the following other types implement trait `FromIterator<A>`:
<polars::prelude::Series as FromIterator<&'a bool>>
<polars::prelude::Series as FromIterator<&'a f32>>
<polars::prelude::Series as FromIterator<&'a f64>>
<polars::prelude::Series as FromIterator<&'a i32>>
<polars::prelude::Series as FromIterator<&'a i64>>
<polars::prelude::Series as FromIterator<&'a str>>
<polars::prelude::Series as FromIterator<&'a u32>>
<polars::prelude::Series as FromIterator<&'a u64>>
and 15 others
note: required by a bound in `std::iter::Iterator::collect`
What is the correct idiom for this kind of procedure? Is there a simpler way to apply a function?
For future seekers, I will explain the general solution and then the specific code to make the example work. I'll also point out some gotchas for this specific example.
Explanation
If you need to use a custom function instead of using the convenient Expr expressions, at the core of it you'll need to make a function that converts the Series of the input column into a Series backed by a ChunkedArray of the correct output type. This function is what you give to map in the select statement in main. The type of the ChunkedArray is the type you provide as GetOutput.
The code inside vector_split_series in the question works for conversion functions of standard numeric types, or List of numeric types. It does not work automatically for Lists of Utf8 strings, for example, as they are treated specially for ChunkedArrays. This is for performance reasons. You need to build up the Series explicitly, via the correct type builder.
In the question's case, we need to use a ListUtf8ChunkedBuilder which will create a ChunkedArray of List<Utf8>.
So in general, the question's code works for conversion outputs that are numeric or Lists of numerics. But for lists of strings, you need to use a ListUtf8ChunkedBuilder.
Correct code
The correct code for the question's example looks like this:
use polars::prelude::*;
fn vector_split(text: &str) -> Vec<String> {
text.split(' ').map(|x| x.to_owned()).collect()
}
fn vector_split_series(s: Series) -> PolarsResult<Series> {
let ca = s.utf8()?;
let mut builder = ListUtf8ChunkedBuilder::new("words", s.len(), ca.get_values_size());
ca.into_iter()
.for_each(|opt_s| match opt_s {
None => builder.append_null(),
Some(s) => {
builder.append_series(
&Series::new("words", vector_split(s).into_iter() )
)
}});
Ok(builder.finish().into_series())
}
fn main() {
let df = df! [
"text" => ["a cat on the mat", "a bat on the hat", "a gnat on the rat"]
].unwrap();
let df2 = df.clone().lazy()
.select([
col("text")
.apply(|s| vector_split_series(s), GetOutput::from_type(DataType::List(Box::new(DataType::Utf8))))
// Can instead use default if the compiler can determine the types
//.apply(|s| vector_split_series(s), GetOutput::default())
.alias("words")
])
.collect()
.unwrap();
println!("{:?}", df2);
}
The core is in vector_split_series. It has that function definition to be used in map.
The match statement is required because Series can have null entries, and to preserve the length of the Series, you need to pass nulls through. We use the builder here so it appends the appropriate null.
For non-null entries the builder needs to append Series. Normally you can append_from_iter, but there is (as of polars 0.26.1) no implementation of FromIterator for Iterator<Item=Vec<T>>. So you need to convert the collection into an iterator on values, and that iterator into a new Series.
Once the larger ChunkedArray (of type ListUtf8ChunkedArray) is built, you can convert it into a PolarsResult<Series> to return to map.
Gotcha
In the above example, vector_split can return Vec<String> or Vec<&str>. This is because split creates its iterator of &str in a nice way.
If you are using something more complicated --- like my original example of extracting text via Soup queries --- if they output iterators of &str, the references may be considered owned by temporary and then you will have issues about returning references to temporaries.
This is why in the working code, I pass Vec<String> back to the builder, even though it is not strictly required.

How can I create an array from a CSV column encoded as a string using Polars in Rust?

I'm trying to write a Rust program which uses Polars to read a CSV. This particular CSV encodes an array of floats as a string.
In my program, I want to load the CSV into a DataFrame and then parse this column into an array of floats. In Python you might write code that looks like this:
df = pd.read_csv("...")
df["babbage_search"] = df.babbage_search.apply(eval).apply(np.array)
However in my Rust program it's not clear how to go about this. I could take an approach like this:
let mut df = CsvReader::from_path("...")?
.has_header(true)
.finish()?;
df.apply("babbage_search", parse_vector)?;
However the parse_vector function is not really clear. I might write something like this, but this won't compile:
fn parse_vector(series: &Series) -> Series {
series
.utf8()
.unwrap()
.into_iter()
.map(|opt_v| match opt_v {
Some(v) => {
let vec: Vec<f64> = serde_json::from_str(v).unwrap();
let series: Series = vec.iter().collect();
series.f64().unwrap().to_owned() as ChunkedArray<Float64Type>
}
None => ChunkedArray::<Float64Type>::default(),
})
.collect()
}
I'd appreciate any help in figuring this out.

iter map(v…) in rust - efficient recalculation of ndarray elements into two arrays at once?

I’d like to iter over array elements and modify them. Additionally I would like (for efficiency purposes) compute at once two new results. The pseudocode in rust is below:
use ndarray::*; //definition of Array1
// let's assume examplary input/output
let x = Array1::random(100, Uniform::<f64>::new(0., 1.));
let y = Array1::random(100, Uniform::<f64>::new(0., 1.));
// transformation of the x array into two arrays (a,b) of the same size
let (a, b): (Array1<f64>, Array1<f64>) = x
.iter()
.map(|v| {
// ... computing
let result1 = v; // just dummy example
let result2 = v*2.0; // just dummy example
return (result1, result2);
})
.collect(); // or unzip()
// after this I’d like to compute something on the result
let res1: f64 = a.dot(&y);
let res2: f64 = b.dot(&y);
It can be done when I map into vec! , but i would not like to make additional conversions/allocations if possible as I use later the result for further calculations (as dot product in the above code)
If I divide calculations in the closure into two parts and try to use for example mapv_inplace, I will have to compute a lot ot the same things again - I do not want to do it as its very expensive.
How to efficiently resolve such issue?

Querying data in Delta lake using Rust delta-rs

How to query data that is in a DeltaLake table using Rust with Delta-rs. In my case, the data is in multiple parquet files.
Thank you
Could you please give me a small code that is working from your side?
You will need either Polars or Datafusion to do so.
Here is a naive approach in rust:
use deltalake::delta::open_table;
use polars::prelude::*;
#[tokio::main]
async fn main() {
let lf = read_delta_table("delta_test_5m").await;
println!("{:?}", lf.select([count()]).collect());
}
async fn read_delta_table(path: &str) -> LazyFrame {
let dt = open_table(path).await.unwrap();
let files = dt.get_files();
let mut df_collection: Vec<DataFrame> = vec![];
for file_path in files.into_iter() {
let full_path = format!("{}/{}", path, file_path.as_ref());
let mut file = std::fs::File::open(full_path).unwrap();
let df = ParquetReader::new(&mut file).finish().unwrap();
df_collection.push(df);
}
let empty_head = df_collection[0].clone().lazy().limit(0);
df_collection.into_iter().fold(empty_head, |acc, df| concat([acc, df.lazy()], false, false).unwrap())
}
This code first get the list of the parquet files to take into account for the most recent version of the delta table.
Then one Dataframe is created for each file.
Finally, these Dataframe are concatenated to get a final Dataframe.
Note that Polars offers this feature out of the box in Python:
import polars as pl
print(pl.read_delta("path_to_delta"))
I did not find how to read Delta directly through Polars in Rust but it should be added soon I guess.

Efficiently build a Polars DataFrame row by row in Rust

I would like to create a large Polars DataFrame using Rust, building it up row by row using data scraped from web pages. What is an efficient way to do this?
It looks like the DataFrame should be created from a Vec of Series rather than adding rows to an empty DataFrame. However, how should a Series be built up efficiently? I could create a Vec and then create a Series from the Vec, but that sounds like it will end up copying all elements. Is there a way to build up a Series element-by-element, and then build a DataFrame from those?
I will actually be building up several DataFrames in parallel using Rayon, then combining them, but it looks like vstack does what I want there. It's the creation of the individual DataFrames that I can't find out how to do efficiently.
I did look at the source of the CSV parser but that is very complicated, and probably highly optimised, but is there a simple approach that is still reasonably efficient?
pub fn from_vec(
name: &str,
v: Vec<<T as PolarsNumericType>::Native, Global>
) -> ChunkedArray<T>
Create a new ChunkedArray by taking ownership of the Vec. This operation is zero copy.
here is the link. You can then call into_series on it.
The simplest, if perhaps not the most performant, answer is to just maintain a map of vectors and turn them into the series that get fed to a DataFrame all at once.
let columns = BTreeMap::new();
for datum in get_data_from_web() {
// For simplicity suppose datum is itself a BTreeMap
// (More likely it's a serde_json::Value)
// It's assumed that every datum has the same keys; if not, the
// Vecs won't have the same length
// It's also assumed that the values of datum are all of the same known type
for (k, v) in datum {
columns.entry(k).or_insert(vec![]).push(v);
}
}
let df = DataFrame::new(
columns.into_iter()
.map(|(name, values)| Series::new(name, values))
.collect::<Vec<_>>()
).unwrap();

Resources