How can I efficiently process many slices of a dataframe in Polars? - rust

I have a dataset of time series data similar to the following:
let series_one = Series::new(
"a",
(0..4).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_two = Series::new(
"b",
(4..8).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_three = Series::new(
"c",
(8..12).into_iter().map(|v| v as f64).collect::<Vec<_>>(),
);
let series_dates = Series::new(
"date",
(0..4)
.into_iter()
.map(|v| NaiveDate::default() + Duration::days(2 * v))
.collect::<Vec<_>>(),
);
let df = DataFrame::new(vec![series_one, series_two, series_three, series_dates]).unwrap();
Which has the following shape:
shape: (4, 4)
┌─────┬─────┬──────┬────────────┐
│ a ┆ b ┆ c ┆ date │
│ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ date │
╞═════╪═════╪══════╪════════════╡
│ 0.0 ┆ 4.0 ┆ 8.0 ┆ 1970-01-01 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0 ┆ 5.0 ┆ 9.0 ┆ 1970-01-02 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2.0 ┆ 6.0 ┆ 10.0 ┆ 1970-01-03 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0 ┆ 7.0 ┆ 11.0 ┆ 1970-01-04 │
└─────┴─────┴──────┴────────────┘
I would like to apply some function which operates on a slice of the dataframe that contains all previous rows for every row in the dataframe.
If I have some function some_fn:
fn some_fn(_df: DataFrame) -> DataFrame {
// Do some operation with the dataframe slice that doesn't need to mutate any data and returns a
// new dataframe with some results
DataFrame::new(vec![
Series::new("a_result", vec![1.0, 2.0, 3.0, 4.0]),
Series::new("b_result", vec![5.0, 6.0, 7.0, 8.0]),
Series::new("c_result", vec![9.0, 10.0, 11.0, 12.0]),
])
.unwrap()
}
and I attempt to do the following:
let size = df.column("a").unwrap().len();
let results = (0..size)
.into_iter()
.map(|i| {
let t = df.head((i + 1).into());
some_fn(t)
})
.reduce(|acc, b| acc.vstack(&b).unwrap())
.unwrap();
I find that it is exceedingly slow, taking about 1ms to process just 3000 rows this way (this is just benchmarking an empty function, so the time here is not due to some heavy computation, just the slicing time). What is the right way to take full advantage of polars and do this processing efficiently?

Related

How do you vertically concatenate two Polars data frames in Rust?

According to the Polars documentation, in Python, you can vertically concatenate two data frames using the procedure shown in the below code snippet:
df_v1 = pl.DataFrame(
{
"a": [1],
"b": [3],
}
)
df_v2 = pl.DataFrame(
{
"a": [2],
"b": [4],
}
)
df_vertical_concat = pl.concat(
[
df_v1,
df_v2,
],
how="vertical",
)
print(df_vertical_concat)
The output of the above code is:
shape: (2, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 4 │
└─────┴─────┘
How do you perform the same operation in Rust?
You can do it as follows:
let df_vertical_concat = df_v1.vstack(&df_v2).unwrap();

How to parse date string with days and months without 0 padding in rust version of polars?

I am reading a csv file with date in month day year format (e.g. "11/15/2022"). But month and day do not have 0 padding. Following is my test code
use polars::prelude::*;
use polars_lazy::prelude::*;
fn main() {
let df = df![
"x" => ["1/4/2011", "2/4/2011", "3/4/2011", "4/4/2011"],
"y" => [1, 2, 3, 4],
].unwrap();
let lf: LazyFrame = df.lazy();
let options = StrpTimeOptions {
fmt: Some("%m/%d/%Y".into()),
date_dtype: DataType::Date,
..Default::default()
};
let res = lf.clone()
.with_column(col("x").str().strptime(options).alias("new time"))
.collect().unwrap();
println!("{:?}", res);
}
The output is
shape: (4, 3)
┌──────────┬─────┬──────────┐
│ x ┆ y ┆ new time │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ date │
╞══════════╪═════╪══════════╡
│ 1/4/2011 ┆ 1 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2/4/2011 ┆ 2 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3/4/2011 ┆ 3 ┆ null │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 4/4/2011 ┆ 4 ┆ null │
in the options I tried "%-m/%-d/%Y instead of "%m/%d/%Y as mentioned in documentation. But it panicked at runtime.
thread '<unnamed>' panicked at 'attempt to subtract with overflow', /home/xxx/.cargo/registry/src/github.com-1ecc6299db9ec823/polars-time-0.21.1/src/chunkedarray/utf8/mod.rs:234:33
What is a correct way to read this format. I am using "Ubuntu 20.04.4 LTS"
Your Default is making it run with the wrong flags. You need to set exact to true:
...
let options = StrpTimeOptions {
fmt: Some("%-m/%-d/%Y".into()),
date_dtype: DataType::Date,
exact: true,
..Default::default()
};
...
Full code with padding included tested:
use polars::prelude::*;
use polars_lazy::dsl::StrpTimeOptions;
use polars_lazy::prelude::{col, IntoLazy, LazyFrame};
fn main() {
let df = df![
"x" => ["01/04/2011", "2/4/2011", "3/4/2011", "4/4/2011"],
"y" => [1, 2, 3, 4],
]
.unwrap();
let lf: LazyFrame = df.lazy();
let options = StrpTimeOptions {
fmt: Some("%-m/%-d/%Y".into()),
date_dtype: DataType::Date,
exact: true,
..Default::default()
};
let res = lf
.clone()
.with_column(col("x").str().strptime(options).alias("new time"))
.collect()
.unwrap();
println!("{:?}", res);
}
Outputs:
shape: (4, 3)
┌────────────┬─────┬────────────┐
│ x ┆ y ┆ new time │
│ --- ┆ --- ┆ --- │
│ str ┆ i32 ┆ date │
╞════════════╪═════╪════════════╡
│ 01/04/2011 ┆ 1 ┆ 2011-01-04 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2/4/2011 ┆ 2 ┆ 2011-02-04 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3/4/2011 ┆ 3 ┆ 2011-03-04 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4/4/2011 ┆ 4 ┆ 2011-04-04 │
└────────────┴─────┴────────────┘

Groupby year-month and drop columns with all NaNs in Python

Based on the output dataframe from this link:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])
models = df.columns[df.columns.str.endswith('_values')]
# function to calculate mape
def mape(y_true, y_pred):
y_pred = np.array(y_pred)
return np.mean(np.abs(y_true - y_pred) / np.clip(np.abs(y_true), 1, np.inf),
axis=0)*100
errors = (df.groupby(pd.Grouper(freq='M'))
.apply(lambda x: mape(x[models], x[['target']]))
)
k = 2
n = len(models)
sorted_args = np.argsort(errors, axis=1) < k
res = pd.merge_asof(df[['target']], sorted_args,
left_index=True,
right_index=True,
direction='forward'
)
topk = df[models].where(res[models])
df = df.join(topk.add_suffix('_mape'))
df = df[['target', 'A_values_mape', 'B_values_mape', 'C_values_mape', 'D_values_mape',
'E_values_mape']]
df
Out:
target A_values_mape ... D_values_mape E_values_mape
2013-02-26 1.281624 6.059783 ... 3.126731 NaN
2013-02-27 0.585713 1.789931 ... 7.843101 NaN
2013-02-28 9.638430 9.623960 ... 5.612724 NaN
2013-03-01 1.950960 NaN ... NaN 5.693051
2013-03-02 0.690563 NaN ... NaN 7.322250
... ... ... ... ...
2013-05-22 5.554824 NaN ... NaN 6.803052
2013-05-23 8.440801 NaN ... NaN 2.756443
2013-05-24 0.968086 NaN ... NaN 0.430184
2013-05-25 0.672555 NaN ... NaN 5.461017
2013-05-26 5.273122 NaN ... NaN 6.312104
How could I groupby year-month and drop columns with all NaNs, then rename the rest columns by ie., top_1, top_2, ..., top_k?
The final expected result could be like this if k=2:
Pseudocode:
df2 = df.filter(regex='_mape$').groupby(pd.Grouper(freq='M')).dropna(axis=1, how='all')
df2.columns = ['top_1', 'top_2', ..., 'top_k']
df.join(df2)
As #Quang Hoang commented in the last post, we may could use justify_nd to achieve that, but I don't know how. Thanks for your help at advance.
EDIT:
dates = pd.date_range('20130226', periods=90)
df = pd.DataFrame(np.random.uniform(0, 10, size=(90, 6)), index=dates, columns=['A_values', 'B_values', 'C_values', 'D_values', 'E_values', 'target'])
models = df.columns[df.columns.str.endswith('_values')]
k = 2
n = len(models)
def grpProc(grp):
err = mape(grp[models], grp[['target']])
# sort_args = np.argsort(err) < k
# cols = models[sort_args]
cols = err.nsmallest(k).index
out_cols = [f'top_{i+1}' for i in range(k)]
rv = grp.loc[:, cols]
rv.columns = out_cols
return rv
wrk = df.groupby(pd.Grouper(freq='M')).apply(grpProc)
res = df[['target']].join(wrk)
print(res)
Out:
target top_1 top_2
2013-02-26 1.281624 6.059783 9.972433
2013-02-27 0.585713 1.789931 0.968944
2013-02-28 9.638430 9.623960 6.165247
2013-03-01 1.950960 4.521452 5.693051
2013-03-02 0.690563 5.178144 7.322250
... ... ...
2013-05-22 5.554824 3.864723 6.803052
2013-05-23 8.440801 5.140268 2.756443
2013-05-24 0.968086 5.890717 0.430184
2013-05-25 0.672555 1.610210 5.461017
2013-05-26 5.273122 6.893207 6.312104
Actually, what you need is, for each group (by year / month):
compute errors locally for the current group,
find k "wanted" columns (calling argsort) and take indicated
columns from models,
take the indicated columns from the current group and rename them to top_…,
return what you generated so far.
To do it, define a "group processing" function:
def grpProc(grp):
err = mape(grp[models], grp[['target']])
sort_args = np.argsort(err) < k
cols = models[sort_args]
out_cols = [f'top_{i+1}' for i in range(k)]
rv = grp.loc[:, cols]
rv.columns = out_cols
return rv
Then, to generate top_… columns alone, apply this function to each group:
wrk = df.groupby(pd.Grouper(freq='M')).apply(grpProc)
And finally generate the expected result joining target column with wrk:
result = df[['target']].join(wrk)
First 15 rows of it, based on your source data, are:
target top_1 top_2
2013-02-26 1.281624 6.059783 3.126731
2013-02-27 0.585713 1.789931 7.843101
2013-02-28 9.638430 9.623960 5.612724
2013-03-01 1.950960 4.521452 5.693051
2013-03-02 0.690563 5.178144 7.322250
2013-03-03 6.177010 8.280144 6.174890
2013-03-04 1.263177 5.896541 4.422322
2013-03-05 5.888856 9.159396 8.906554
2013-03-06 2.013227 8.237912 3.075435
2013-03-07 8.482991 1.546148 6.476141
2013-03-08 7.986413 3.322442 4.738473
2013-03-09 5.944385 7.769769 0.631033
2013-03-10 7.543775 3.710198 6.787289
2013-03-11 5.816264 3.722964 6.795556
2013-03-12 3.054002 3.304891 8.258990
Edit
For the first group (2013-02-28) err contains:
A_values 48.759348
B_values 77.023855
C_values 325.376455
D_values 74.422508
E_values 60.602101
Note that 2 lowest error values are 48.759348 and 60.602101,
so from this group you should probably take A_values (this is OK)
and E_values (instead of D_values).
So maybe grpProc function instead of:
sort_args = np.argsort(err) < k
cols = models[sort_args]
should contain:
cols = err.nsmallest(k).index

featuretools: manual derivation of the features generated by dfs?

Code example:
import featuretools as ft
es = ft.demo.load_mock_customer(return_entityset=True)
# Normalized one more time
es = es.normalize_entity(
new_entity_id="device",
base_entity_id="sessions",
index="device",
)
feature_matrix, feature_defs = ft.dfs(
entityset=es,
target_entity="customers",
agg_primitives=["std",],
groupby_trans_primitives=['cum_count'],
max_depth=2
)
And I'd like to look into STD(sessions.CUM_COUNT(device) by customer_id) feature deeper:
And I tried to generate this feature manually but had different result:
df = ft.demo.load_mock_customer(return_single_table=True)
a = df.groupby("customer_id")['device'].cumcount()
a.name = "cumcount_device"
a = pd.concat([df, a], axis=1)
b = a.groupby("customer_id")['cumcount_device'].std()
>>> b
customer_id
1 36.517
2 26.991
3 26.991
4 31.610
5 22.949
Name: cumcount_device, dtype: float64
What am I missing?
Thanks for the question. The calculation needs to be based on the data frame from sessions.
df = es['sessions'].df
cumcount = df['device'].groupby(df['customer_id']).cumcount()
std = cumcount.groupby(df['customer_id']).std()
std.round(3).loc[feature_matrix.index]
customer_id
5 1.871
4 2.449
1 2.449
3 1.871
2 2.160
dtype: float64
You should get the same output as in DFS.

Elegant way to find indexes for lists within lists?

Pretty new to python. I'm trying to index items in CSV files by row/column. The only method I've found is implementing a for loop to search each row in the list.
readCSV = [['', 'A', 'B', 'C', 'D'],
[1.0, 3.1, 5.0, 1.7, 8.2],
[2.0, 6.2, 7.0, 2.2, 9.3],
[3.0, 8.8, 5.5, 4.4, 6.0]]
row_column = []
for row in readCSV:
if my_item in row:
row_column.append(row[0])
row_column.append(readCSV[0][row.index(my_item)])
So for my_item = 6.2, I get row_column = [2.0, 'A'].
This works fine, but I can't help thinking there's a more elegant solution.
Try this one:
result = [(i, j) for i, k in enumerate(readCSV) for j, n in enumerate(k) if my_item == n]
import pandas as pd
import numpy as np
df = pd.DataFrame(readCSV[1:],columns=readCSV[0])
#### Output ####
No A B C D
0 1.0 3.1 5.0 1.7 8.2
1 2.0 6.2 7.0 2.2 9.3
2 3.0 8.8 5.5 4.4 6.0
##This provides the row in which there is a hit.
df1 = df[(df.A == my_item) | (df.B == my_item) |(df.C == my_item) | (df.D == my_item)]
print(df1)
#### Output ####
No A B C D
1 2.0 6.2 7.0 2.2 9.3
##If you want only those column values which is a hit for your my_item.
z1 = pd.concat([df[df['A'] == my_item][['No','A']],df[df['B'] == my_item][['No','B']],df[df['C'] == my_item][['No','C']],df[df['D'] == my_item][['No','D']]])
print(z1)
#### Output ####
A B C D No
1 6.2 NaN NaN NaN 2.0
## Incase if you want drop the nan , you can use np.isnan
z1 = np.array(z1)
print(z1[:,~np.any(np.isnan(z1), axis=0)])
#### Output ####
[[6.2 2. ]]

Resources