I'm working on a little utility app, coming from python/pandas and trying to rebuild some basic tools that can be distributed via executables. I'm having a hard time interpreting the documentation for what seems like it should be a fairly simple process of reading some raw data, resampling it based on the datetime column, and then interpolating it to fill in missing data as necessary.
My cargo.toml looks like:
[dependencies]
polars = "0.19.0"
And the code I've written so far is:
use polars::prelude::*;
use std::fs::File;
fn main() {
let mut df = CsvReader::new("raw.csv".into())
.finish();
//interpolate to clean up blank/nan
//resample/groupby 15Min-1D using mean, blank/nan if missing
let mut file = File::create("final.csv").expect("File not written!!!");
CsvWriter::new(&mut file)
.has_header(true)
.with_delimiter(b',')
.finish(&df);
}
and the raw.csv data might look like:
site,datetime,val1,val2,val3,val4,val5,val6
XX1,2021-01-01 00:45,,,,4.60,,
XX1,2021-01-01 00:50,,,,2.30,,
XX1,2021-01-01 00:53,21.90,16.00,77.67,3.45,1027.20,0.00
XX1,2021-01-01 01:20,,,,4.60,,
XX1,2021-01-01 01:53,21.90,16.00,77.67,3.45,1026.90,0.00
XX1,2021-01-01 01:55,,,,0.00,,
XX1,2021-01-01 02:00,,,,0.00,,
XX1,2021-01-01 02:45,,,,5.75,,
XX1,2021-01-01 02:50,,,,8.05,,
XX1,2021-01-01 02:53,21.00,16.00,80.69,8.05,1026.80,0.00
But I can't seem to call the methods because I get errors like:
method not found in `Result<DataFrame, PolarsError>`
or
expected struct `DataFrame`, found enum `Result`
and I'm not sure how to properly shift between classes.
I've tried obviously wrong answers like:
let grouped = df.lazy().groupby_dynamic("datetime", "1h").agg("datetime", mean());
but basically, I'm looking for the polars equivalent of pandas code:
df = df.interpolate()
df = df.resample(sample_frequency).mean()
Any help would be appreciated!
Here is an example of how you might:
upsample via a left join on a date range
filling missing values with interpolate
downsample via a groupby_dynamic
use chrono::prelude::*;
use polars::prelude::*;
use polars_core::time::*;
use std::io::Cursor;
use polars::frame::groupby::DynamicGroupOptions;
fn main() -> Result<()> {
let csv = "site,datetime,val1,val2,val3,val4,val5,val6
XX1,2021-01-01 00:45,,,,4.60,,
XX1,2021-01-01 00:50,,,,2.30,,
XX1,2021-01-01 00:53,21.90,16.00,77.67,3.45,1027.20,0.00
XX1,2021-01-01 01:20,,,,4.60,,
XX1,2021-01-01 01:53,21.90,16.00,77.67,3.45,1026.90,0.00
XX1,2021-01-01 01:55,,,,0.00,,
XX1,2021-01-01 02:00,,,,0.00,,
XX1,2021-01-01 02:45,,,,5.75,,
XX1,2021-01-01 02:50,,,,8.05,,
XX1,2021-01-01 02:53,21.00,16.00,80.69,8.05,1026.80,0.00
";
let cursor = Cursor::new(csv);
// prefer scan csv when your data is not in memory
let mut df = CsvReader::new(cursor).finish()?;
df.try_apply("datetime", |s| {
s.utf8()?
.as_datetime(Some("%Y-%m-%d %H:%M"), TimeUnit::Nanoseconds)
.map(|ca| ca.into_series())
})?;
// now we take the datetime column and extract timestamps from them
// with these timestamps we create a `date_range` with an interval of 1 minute
let dt = df.column("datetime")?;
let timestamp = dt.cast(&DataType::Int64)?;
let timestamp_ca = timestamp.i64()?;
let first = timestamp_ca.get(0).unwrap();
let last = timestamp_ca.get(timestamp_ca.len() - 1).unwrap();
let range = date_range(
first,
last,
Duration::parse("1m"),
ClosedWindow::Both,
"date_range",
TimeUnit::Nanoseconds,
);
let range_df = DataFrame::new(vec![range.into_series()])?;
// now that we got the date_range we use it to upsample the dataframe.
// after that we interpolate the missing values
// and then we groupby in a fixed time interval to get more regular output
let out = range_df
.lazy()
.join(
df.lazy(),
[col("date_range")],
[col("datetime")],
JoinType::Left,
)
.select([col("*").interpolate()])
.groupby_dynamic([], DynamicGroupOptions {
index_column: "date_range".into(),
every: Duration::parse("15m"),
period: Duration::parse("15m"),
offset: Duration::parse("0m"),
truncate: true,
include_boundaries: false,
closed_window: ClosedWindow::Left,
}).agg([col("*").first()])
.collect()?;
dbg!(out);
Ok(())
}
These are the features I used:
["csv-file", "pretty_fmt", "temporal", "dtype-date", "dtype-datetime", "lazy", "interpolate", "dynamic_groupby"]
Output
This ouputs
[src/main.rs:68] out = shape: (9, 9)
┌─────────────────────┬─────────────────────┬────────────┬────────────┬─────┬────────────┬────────────┬─────────────┬────────────┐
│ date_range ┆ date_range_first ┆ site_first ┆ val1_first ┆ ... ┆ val3_first ┆ val4_first ┆ val5_first ┆ val6_first │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ datetime[ns] ┆ str ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════════════════╪═════════════════════╪════════════╪════════════╪═════╪════════════╪════════════╪═════════════╪════════════╡
│ 2021-01-01 00:45:00 ┆ 2021-01-01 00:45:00 ┆ XX1 ┆ null ┆ ... ┆ null ┆ 4.6 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 01:00:00 ┆ 2021-01-01 01:00:00 ┆ null ┆ 21.9 ┆ ... ┆ 77.67 ┆ 4.6 ┆ 1027.165 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 01:15:00 ┆ 2021-01-01 01:15:00 ┆ null ┆ 21.9 ┆ ... ┆ 77.67 ┆ 4.6 ┆ 1027.09 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 01:30:00 ┆ 2021-01-01 01:30:00 ┆ null ┆ 21.9 ┆ ... ┆ 77.67 ┆ 4.251515 ┆ 1027.015 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 02:00:00 ┆ 2021-01-01 02:00:00 ┆ XX1 ┆ 21.795 ┆ ... ┆ 78.022333 ┆ 0.0 ┆ 1027.153333 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 02:15:00 ┆ 2021-01-01 02:15:00 ┆ null ┆ 21.57 ┆ ... ┆ 78.777333 ┆ 4.983333 ┆ 1027.053333 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 02:30:00 ┆ 2021-01-01 02:30:00 ┆ null ┆ 21.345 ┆ ... ┆ 79.532333 ┆ 5.366667 ┆ 1026.953333 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 02:45:00 ┆ 2021-01-01 02:45:00 ┆ XX1 ┆ 21.12 ┆ ... ┆ 80.287333 ┆ 5.75 ┆ 1026.853333 ┆ 0.0 │
└─────────────────────┴─────────────────────┴────────────┴────────────┴─────┴────────────┴────────────┴─────────────┴────────────┘
Note that polars_core also is needed for the time module, this will be exported to polars next patch.
Try something like this:
use polars::prelude::*;
use std::fs::File;
fn main() {
let mut df = CsvReader::new("raw.csv".into())
.finish()
.unwrap();
//interpolate to clean up blank/nan
//resample/groupby 15Min-1D using mean, blank/nan if missing
let mut file = File::create("final.csv").expect("File not written!!!");
CsvWriter::new(&mut file)
.has_header(true)
.with_delimiter(b',')
.finish(&df)
.unwrap();
}
In general, if you get an error about Result, try adding .unwrap() or .expect(...).
Related
How can I read a JSON file with polars, with the following format:
{<json object>},
{<json object>}
I can read the same file in DataFusion as follows:
#[tokio::main]
async fn main() -> datafusion::error::Result<()> {
env_logger::init();
let start = Instant::now();
let file_path = "datalayers/landing/Toys_and_Games_5.json";
// let file_path = "datalayers/landing/test_file.json";
let df = read_data(file_path.to_string()).await?;
let duration = start.elapsed();
info!{"Pipeline executed successfully!"}
info!("Pipeline Execution time: {:?}", duration);
Ok(())
}
async fn read_data(path: String) -> datafusion::error::Result<Arc<DataFrame>> {
let mut ctx = SessionContext::new();
let selected_columns = vec![
"asin",
"vote",
"verified",
"unixReviewTime",
"reviewTime",
"reviewText",
];
let df_ = ctx.read_json(path, NdJsonReadOptions::default()).await?;
let df_ = df_.select_columns(&selected_columns)?;
info!("Data loading plan created successfully!");
Ok(df_)
}
The spark code is quite similar. The only reference I found for polars in old API documentation with JsonReader and Cursor. But it does not show how to read data from file. The File in the example can be downloaded with wget as follows:
wget -P datalayers/landing http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Toys_and_Games_5.json.gz
Using the latest polars as of 2022 You have to make sure that "json" is added as to the feature in [dependencies] in Cargo for polars. For example,
[dependencies]
polars = { version="0.24.2", features = ["lazy", "json"] }
tokio = { version = "1.21.1", features = ["full"] }
Now for reading line-separated json (I know there is an official name but it escapes me right now) into a dataframe:
use polars::prelude::*;
fn main() -> PolarsResult<()> {
let schema = Schema::from(vec![
Field::new("reviwerID", DataType::Utf8),
Field::new("asin", DataType::Utf8),
Field::new("reviewerName", DataType::Utf8),
Field::new("helpful", DataType::List(Box::new(DataType::Int32))),
Field::new("reviewText", DataType::Utf8),
Field::new("overall", DataType::Float64),
Field::new("summary", DataType::Utf8),
Field::new("unixReviewTime", DataType::Int64),
Field::new("reviewTime", DataType::Utf8),
Field::new("style", DataType::Utf8),
]);
let df = match LazyJsonLineReader::new("Toys_and_Games_5.ndjson".into())
.with_schema(schema)
.finish() {
Ok(lf) => lf,
Err(e) => panic!("Error: {}", e),
}
.collect();
println!("{:?}", df);
Ok(())
}
The output is:
Ok(shape: (3695, 10)
┌───────────┬────────────┬──────────────────────────┬─────────┬─────┬──────────────────────────────────┬────────────────┬─────────────┬───────┐
│ reviwerID ┆ asin ┆ reviewerName ┆ helpful ┆ ... ┆ summary ┆ unixReviewTime ┆ reviewTime ┆ style │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i32 ┆ ┆ str ┆ i64 ┆ str ┆ str │
╞═══════════╪════════════╪══════════════════════════╪═════════╪═════╪══════════════════════════════════╪════════════════╪═════════════╪═══════╡
│ null ┆ 0486427706 ┆ Ginger ┆ null ┆ ... ┆ Nice book ┆ 1381017600 ┆ 10 6, 2013 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ 0486427706 ┆ Dragonflies & Autumn ┆ null ┆ ... ┆ Great pictures ┆ 1376006400 ┆ 08 9, 2013 ┆ null │
│ ┆ ┆ Leaves ┆ ┆ ┆ ┆ ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ 0486427706 ┆ barbara ann ┆ null ┆ ... ┆ The pictures are great, I've ┆ 1459814400 ┆ 04 5, 2016 ┆ null │
│ ┆ ┆ ┆ ┆ ┆ don... ┆ ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ 0486427706 ┆ Samantha ┆ null ┆ ... ┆ So beautiful! ┆ 1455321600 ┆ 02 13, 2016 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ B00IEOH8KO ┆ A. Red ┆ null ┆ ... ┆ Made party decorating easy and ┆ 1453939200 ┆ 01 28, 2016 ┆ null │
│ ┆ ┆ ┆ ┆ ┆ a... ┆ ┆ ┆ │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ B00IEOH8KO ┆ nilda morales ┆ null ┆ ... ┆ Four Stars ┆ 1452988800 ┆ 01 17, 2016 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ B00IEOH8KO ┆ W. Ross ┆ null ┆ ... ┆ Five Stars ┆ 1449014400 ┆ 12 2, 2015 ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ null ┆ B00IEOH8KO ┆ Liz89 ┆ null ┆ ... ┆ Great Deal ┆ 1431129600 ┆ 05 9, 2015 ┆ null │
└───────────┴────────────┴──────────────────────────┴─────────┴─────┴──────────────────────────────────┴────────────────┴─────────────┴───────┘)
#IMPORTANT NOTE
Notice that I had to explicitly provide the schema because the "style" column in your data-set is completely weird, which means that polars has difficulty inferring it. If you fill out the complete schema it should work, so I didn't bother creating a Struct for it, however you can go ahead and do that if you like :).
I would like to only include unique values in my polars Dataframe, based on one column.
In the example below I would like to create a new dataframe with only uniques based on the "col_float" column.
Before:
┬───────────┬──────────┬────────────┬────────────┐
┆ col_float ┆ col_bool ┆ col_str ┆ col_date │
┆ --- ┆ --- ┆ --- ┆ --- │
┆ f64 ┆ bool ┆ str ┆ date │
╪═══════════╪══════════╪════════════╪════════════╡
┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 20.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
┴───────────┴──────────┴────────────┴────────────┘
after:
┬───────────┬──────────┬────────────┬────────────┐
┆ col_float ┆ col_bool ┆ col_str ┆ col_date │
┆ --- ┆ --- ┆ --- ┆ --- │
┆ f64 ┆ bool ┆ str ┆ date │
╪═══════════╪══════════╪════════════╪════════════╡
┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
┴───────────┴──────────┴────────────┴────────────┘
(Notice the third row getting dropped because col_float was not unique)
Intuitively, one of my attempts was:
let mut df = pl.DataFrame(
{
"col_float": [10.0, 20.0, 20.0, 40.0],
"col_bool": [True, False, True, False],
"col_str": pl.repeat("2020-01-01", 4, eager=True),
};
let mut df2=DataFrame::new(vec![&df[0]]).unwrap();
df= df.unique(df2,UniqueKeepStrategy::First);
but got:
expected `Option<&[String]>`, found `DataFrame`
Which was to be expected beforehand of course.
I'm not sure whether im using to right function and if I do, how this subset should be passed. Searching the documentation or github did not help me as in the examples or code only "None" was passed as the subset.
Seemed less of an polars related question, but more related to my experience with Rust.
Working example:
let mut df = pl.DataFrame(
{
"col_float": [10.0, 20.0, 20.0, 40.0],
"col_bool": [True, False, True, False],
"col_str": pl.repeat("2020-01-01", 4, eager=True),
};
df= df.unique(Some(&["col_float".to_string()]),UniqueKeepStrategy::First);
I want to filter all duplicated rows from a polars dataframe. What I've tried:
df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']])
df
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
df.filter(pl.all().is_duplicated())
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
This selects the first row, because it appears to go column-by-column and returns each row where all columns have a corresponding duplicate in the respective column - not the intended outcome.
Boolean indexing works:
df[df.is_duplicated(), :]
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
But it leaves me wondering
if this is indeed the only way to do it,
if there's a way to use .filter() and expressions to achieve the same result
if this is the most efficient way to achieve the desired result
In general, the is_duplicated method will likely perform best. Let's take a look at some alternative ways to accomplish this. And we'll do some (very) non-rigorous benchmarking - just to see which ones perform reasonably well.
Some alternatives
One alternative is a filter statement with an over (windowing) expression on all columns. One caution with windowed expressions - they are convenient, but can be costly performance-wise.
df.filter(pl.count("column_1").over(df.columns) > 1)
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
Another alternative is a groupby, followed by a join. Basically, we'll count the number of times that combinations of columns occur. I'm using a semi join here, simply because I don't want to include the count column in my final results.
df.join(
df=df.groupby(df.columns)
.agg(pl.count().alias("count"))
.filter(pl.col("count") > 1),
on=df.columns,
how="semi",
)
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
Some (very) non-rigorous benchmarking
One way to see which alternatives perform reasonably well is to time the performance on a test dataset that might resemble the datasets that you will use. For lack of something better, I'll stick to something that looks close to the dataset in your question.
Set nbr_rows to something that will challenge your machine. (My machine is a 32-core system, so I'm going to choose a reasonably high number of rows.)
import numpy as np
import string
nbr_rows = 100_000_000
df = pl.DataFrame(
{
"col1": np.random.choice(1_000, nbr_rows,),
"col2": np.random.choice(1_000, nbr_rows,),
"col3": np.random.choice(list(string.ascii_letters), nbr_rows,),
"col4": np.random.choice(1_000, nbr_rows,),
}
)
print(df)
shape: (100000000, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╡
│ 955 ┆ 186 ┆ j ┆ 851 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 530 ┆ 199 ┆ d ┆ 376 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 109 ┆ 609 ┆ G ┆ 115 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 886 ┆ 487 ┆ d ┆ 479 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 837 ┆ 406 ┆ Y ┆ 60 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 467 ┆ 769 ┆ P ┆ 344 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 548 ┆ 372 ┆ F ┆ 410 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 379 ┆ 578 ┆ t ┆ 287 │
└──────┴──────┴──────┴──────┘
Now let's benchmark some alternatives. Since these may or may not resemble your datasets (or your computing platform), I won't run the benchmarks multiple times. For our purposes, we're just trying to weed out alternatives that might perform very poorly.
Alternative: is_duplicated
import time
start = time.perf_counter()
df[df.is_duplicated(),:]
end = time.perf_counter()
print(end - start)
>>> print(end - start)
7.834882180000932
Since the is_duplicated method is provided by the Polars API, we can be reasonably assured that it will perform very well. Indeed, this should be the standard against which we compare other alternatives.
Alternative: filter using an over (windowing) expression
start = time.perf_counter()
df.filter(pl.count("col1").over(df.columns) > 1)
end = time.perf_counter()
print(end - start)
>>> print(end - start)
18.136289041000055
As expected, the over (windowing) expression is rather costly.
Alternative: groupby followed by a join
start = time.perf_counter()
df.join(
df=df.groupby(df.columns)
.agg(pl.count().alias("count"))
.filter(pl.col("count") > 1),
on=df.columns,
how="semi",
)
end = time.perf_counter()
print(end - start)
>>> print(end - start)
9.419006452999383
Somewhat better ... but not as good as using the is_duplicated method provided by the Polars API.
Alternative: concat_str
Let's also look at an alternative suggested in another answer. To be fair, #FBruzzesi did say "I am not sure this is optimal by any means". But let's look at how it performs.
start = time.perf_counter()
df.filter(pl.concat_str(df.columns, sep='|').is_duplicated())
end = time.perf_counter()
print(end - start)
>>> print(end - start)
37.238660977998734
Edit
Additional Alternative: filter and is_duplicated
We can also use filter with is_duplicated. Since df.is_duplicated() is not a column in the DataFrame when the filter is run, we'll need to wrap it in a polars.lit Expression.
start = time.perf_counter()
df.filter(pl.lit(df.is_duplicated()))
end = time.perf_counter()
print(end - start)
>>> print(end - start)
8.115436136999051
This performs just as well as using is_duplicated and boolean indexing.
Did this help? If nothing else, this shows some different ways to use the Polars API.
I think the optimal way really is:
df.filter(df.is_duplicated())
We have to be aware that Polars have is_duplicated() methods in the expression API and in the DataFrame API, but for the purpose of visualizing the duplicated lines we need to evaluate each column and have a consensus in the end if the column is duplicated or not.
The df.is_duplicated() will return a vector with boolean values,
It looks like this:
In []: df.is_duplicated()
Out[]:
shape: (4,)
Series: '' [bool]
[
false
true
false
true
]
In []:
Then leveraging our expression API I think we can do this:
In []: df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']])
...: df
Out[]:
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 3 │
│ 1 ┆ 7 ┆ 9 │
│ 1 ┆ 2 ┆ 3 │
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
In []: df.filter(df.is_duplicated())
Out[]:
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
And then for subseting:
In []: df.filter(df.select(["column_0", "column_1"]).is_duplicated())
Out[]:
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 3 │
│ 1 ┆ 7 ┆ 9 │
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
Any other way for me seems to be trying to hack the API, I wish we had a way to "fold" the boolean columns we get by the expressions
df.filter(pl.col("*").is_duplicated())
df.filter(pl.all().is_duplicated())
It is funny, because look at this:
In []: df.select(pl.all().is_duplicated())
Out[]:
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞══════════╪══════════╪══════════╡
│ true ┆ true ┆ true │
│ true ┆ true ┆ true │
│ true ┆ false ┆ true │
│ true ┆ true ┆ true │
└──────────┴──────────┴──────────┘
This is what happens when we use is_duplicated() as expression.
I am not sure this is optimal by any mean but you could concatenate all rows and check for duplicates, namely:
import polars as pl
df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']], columns=["a", "b", "c"])
df.filter(pl.concat_str(["a", "b", "c"]).is_duplicated())
shape: (2, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└─────┴─────┴─────┘
I've started to practice rust.
I've run the program and got:
VecStorage { data: [1.0, 88.0, 87.0, 1.0, 70.0, 77.0, 1.0, 80.0, 79.0, 1.0, 82.0, 85.0, 1.0, 90.0, 97.0, 1.0, 100.0, 98.0], nrows: Dynamic { value: 6 }, ncols: Dynamic { value: 3 } }
VecStorage { data: [1.0, 1.0, 1.0, 88.0, 80.0, 90.0, 87.0, 79.0, 97.0, 1.0, 1.0, 1.0, 70.0, 82.0, 100.0, 77.0, 85.0, 98.0], nrows: Dynamic { value: 3 }, ncols: Dynamic { value: 6 } }
main.rs:
let matrix = vec![1.0,88.0,87.0,1.0,70.0,77.0,1.0,80.0,79.0,1.0,82.0,85.0,1.0,90.0,97.0,1.0,100.0,98.0];
let matrix = DMatrix::from_vec(6,3,matrix);
println!("{:?}",matrix);
println!("{:?}",matrix.transpose());
However transpose matrix is different(from the correct transpose), any ideas why?
According to the documentation, the matrix is filled from vector in column-major order. I suspect from the provided data that you're expecting a row-major one (with a first column of the original matrix holding only 1s).
We can see this if we use Display formatting instead of Debug - the output in this case is the following:
┌ ┐
│ 1 1 1 │
│ 88 80 90 │
│ 87 79 97 │
│ 1 1 1 │
│ 70 82 100 │
│ 77 85 98 │
└ ┘
┌ ┐
│ 1 88 87 1 70 77 │
│ 1 80 79 1 82 85 │
│ 1 90 97 1 100 98 │
└ ┘
Playground
Most numbering systems start with zero, go through the base-10 digits, and then go to letters once the base-10 digits have been exhausted:
Binary: 0,1
Octal: 0,1,2,3,4,5,6,7
Decimal: 0,1,2,3,4,5,6,7,8,9
Hexidecimal: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F
Even the ascii order of characters has numbers come before letters.
The Base64 encoding scheme does things differently:
┌──────┬──────────┬┬──────┬──────────┬┬──────┬──────────┬┬──────┬──────────┐
│Value │ Encoding ││Value │ Encoding ││Value │ Encoding ││Value │ Encoding │
├──────┼──────────┼┼──────┼──────────┼┼──────┼──────────┼┼──────┼──────────┤
│ 0 │ A ││ 17 │ R ││ 34 │ i ││ 51 │ z │
│ 1 │ B ││ 18 │ S ││ 35 │ j ││ 52 │ 0 │
│ 2 │ C ││ 19 │ T ││ 36 │ k ││ 53 │ 1 │
│ 3 │ D ││ 20 │ U ││ 37 │ l ││ 54 │ 2 │
│ 4 │ E ││ 21 │ V ││ 38 │ m ││ 55 │ 3 │
│ 5 │ F ││ 22 │ W ││ 39 │ n ││ 56 │ 4 │
│ 6 │ G ││ 23 │ X ││ 40 │ o ││ 57 │ 5 │
│ 7 │ H ││ 24 │ Y ││ 41 │ p ││ 58 │ 6 │
│ 8 │ I ││ 25 │ Z ││ 42 │ q ││ 59 │ 7 │
│ 9 │ J ││ 26 │ a ││ 43 │ r ││ 60 │ 8 │
│ 10 │ K ││ 27 │ b ││ 44 │ s ││ 61 │ 9 │
│ 11 │ L ││ 28 │ c ││ 45 │ t ││ 62 │ + │
│ 12 │ M ││ 29 │ d ││ 46 │ u ││ 63 │ / │
│ 13 │ N ││ 30 │ e ││ 47 │ v ││ │ │
│ 14 │ O ││ 31 │ f ││ 48 │ w ││(pad) │ = │
│ 15 │ P ││ 32 │ g ││ 49 │ x ││ │ │
│ 16 │ Q ││ 33 │ h ││ 50 │ y ││ │ │
└──────┴──────────┴┴──────┴──────────┴┴──────┴──────────┴┴──────┴──────────┘
Is there a reason why base64 chose to do letters before numbers? Wouldn't it have made more sense for the value 0 to be represented by the encoding 0?