I've started to practice rust.
I've run the program and got:
VecStorage { data: [1.0, 88.0, 87.0, 1.0, 70.0, 77.0, 1.0, 80.0, 79.0, 1.0, 82.0, 85.0, 1.0, 90.0, 97.0, 1.0, 100.0, 98.0], nrows: Dynamic { value: 6 }, ncols: Dynamic { value: 3 } }
VecStorage { data: [1.0, 1.0, 1.0, 88.0, 80.0, 90.0, 87.0, 79.0, 97.0, 1.0, 1.0, 1.0, 70.0, 82.0, 100.0, 77.0, 85.0, 98.0], nrows: Dynamic { value: 3 }, ncols: Dynamic { value: 6 } }
main.rs:
let matrix = vec![1.0,88.0,87.0,1.0,70.0,77.0,1.0,80.0,79.0,1.0,82.0,85.0,1.0,90.0,97.0,1.0,100.0,98.0];
let matrix = DMatrix::from_vec(6,3,matrix);
println!("{:?}",matrix);
println!("{:?}",matrix.transpose());
However transpose matrix is different(from the correct transpose), any ideas why?
According to the documentation, the matrix is filled from vector in column-major order. I suspect from the provided data that you're expecting a row-major one (with a first column of the original matrix holding only 1s).
We can see this if we use Display formatting instead of Debug - the output in this case is the following:
┌ ┐
│ 1 1 1 │
│ 88 80 90 │
│ 87 79 97 │
│ 1 1 1 │
│ 70 82 100 │
│ 77 85 98 │
└ ┘
┌ ┐
│ 1 88 87 1 70 77 │
│ 1 80 79 1 82 85 │
│ 1 90 97 1 100 98 │
└ ┘
Playground
Related
I would like to only include unique values in my polars Dataframe, based on one column.
In the example below I would like to create a new dataframe with only uniques based on the "col_float" column.
Before:
┬───────────┬──────────┬────────────┬────────────┐
┆ col_float ┆ col_bool ┆ col_str ┆ col_date │
┆ --- ┆ --- ┆ --- ┆ --- │
┆ f64 ┆ bool ┆ str ┆ date │
╪═══════════╪══════════╪════════════╪════════════╡
┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 20.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
┴───────────┴──────────┴────────────┴────────────┘
after:
┬───────────┬──────────┬────────────┬────────────┐
┆ col_float ┆ col_bool ┆ col_str ┆ col_date │
┆ --- ┆ --- ┆ --- ┆ --- │
┆ f64 ┆ bool ┆ str ┆ date │
╪═══════════╪══════════╪════════════╪════════════╡
┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
┴───────────┴──────────┴────────────┴────────────┘
(Notice the third row getting dropped because col_float was not unique)
Intuitively, one of my attempts was:
let mut df = pl.DataFrame(
{
"col_float": [10.0, 20.0, 20.0, 40.0],
"col_bool": [True, False, True, False],
"col_str": pl.repeat("2020-01-01", 4, eager=True),
};
let mut df2=DataFrame::new(vec![&df[0]]).unwrap();
df= df.unique(df2,UniqueKeepStrategy::First);
but got:
expected `Option<&[String]>`, found `DataFrame`
Which was to be expected beforehand of course.
I'm not sure whether im using to right function and if I do, how this subset should be passed. Searching the documentation or github did not help me as in the examples or code only "None" was passed as the subset.
Seemed less of an polars related question, but more related to my experience with Rust.
Working example:
let mut df = pl.DataFrame(
{
"col_float": [10.0, 20.0, 20.0, 40.0],
"col_bool": [True, False, True, False],
"col_str": pl.repeat("2020-01-01", 4, eager=True),
};
df= df.unique(Some(&["col_float".to_string()]),UniqueKeepStrategy::First);
I want to filter all duplicated rows from a polars dataframe. What I've tried:
df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']])
df
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
df.filter(pl.all().is_duplicated())
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
This selects the first row, because it appears to go column-by-column and returns each row where all columns have a corresponding duplicate in the respective column - not the intended outcome.
Boolean indexing works:
df[df.is_duplicated(), :]
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
But it leaves me wondering
if this is indeed the only way to do it,
if there's a way to use .filter() and expressions to achieve the same result
if this is the most efficient way to achieve the desired result
In general, the is_duplicated method will likely perform best. Let's take a look at some alternative ways to accomplish this. And we'll do some (very) non-rigorous benchmarking - just to see which ones perform reasonably well.
Some alternatives
One alternative is a filter statement with an over (windowing) expression on all columns. One caution with windowed expressions - they are convenient, but can be costly performance-wise.
df.filter(pl.count("column_1").over(df.columns) > 1)
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
Another alternative is a groupby, followed by a join. Basically, we'll count the number of times that combinations of columns occur. I'm using a semi join here, simply because I don't want to include the count column in my final results.
df.join(
df=df.groupby(df.columns)
.agg(pl.count().alias("count"))
.filter(pl.col("count") > 1),
on=df.columns,
how="semi",
)
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
Some (very) non-rigorous benchmarking
One way to see which alternatives perform reasonably well is to time the performance on a test dataset that might resemble the datasets that you will use. For lack of something better, I'll stick to something that looks close to the dataset in your question.
Set nbr_rows to something that will challenge your machine. (My machine is a 32-core system, so I'm going to choose a reasonably high number of rows.)
import numpy as np
import string
nbr_rows = 100_000_000
df = pl.DataFrame(
{
"col1": np.random.choice(1_000, nbr_rows,),
"col2": np.random.choice(1_000, nbr_rows,),
"col3": np.random.choice(list(string.ascii_letters), nbr_rows,),
"col4": np.random.choice(1_000, nbr_rows,),
}
)
print(df)
shape: (100000000, 4)
┌──────┬──────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╡
│ 955 ┆ 186 ┆ j ┆ 851 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 530 ┆ 199 ┆ d ┆ 376 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 109 ┆ 609 ┆ G ┆ 115 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 886 ┆ 487 ┆ d ┆ 479 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 837 ┆ 406 ┆ Y ┆ 60 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 467 ┆ 769 ┆ P ┆ 344 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 548 ┆ 372 ┆ F ┆ 410 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 379 ┆ 578 ┆ t ┆ 287 │
└──────┴──────┴──────┴──────┘
Now let's benchmark some alternatives. Since these may or may not resemble your datasets (or your computing platform), I won't run the benchmarks multiple times. For our purposes, we're just trying to weed out alternatives that might perform very poorly.
Alternative: is_duplicated
import time
start = time.perf_counter()
df[df.is_duplicated(),:]
end = time.perf_counter()
print(end - start)
>>> print(end - start)
7.834882180000932
Since the is_duplicated method is provided by the Polars API, we can be reasonably assured that it will perform very well. Indeed, this should be the standard against which we compare other alternatives.
Alternative: filter using an over (windowing) expression
start = time.perf_counter()
df.filter(pl.count("col1").over(df.columns) > 1)
end = time.perf_counter()
print(end - start)
>>> print(end - start)
18.136289041000055
As expected, the over (windowing) expression is rather costly.
Alternative: groupby followed by a join
start = time.perf_counter()
df.join(
df=df.groupby(df.columns)
.agg(pl.count().alias("count"))
.filter(pl.col("count") > 1),
on=df.columns,
how="semi",
)
end = time.perf_counter()
print(end - start)
>>> print(end - start)
9.419006452999383
Somewhat better ... but not as good as using the is_duplicated method provided by the Polars API.
Alternative: concat_str
Let's also look at an alternative suggested in another answer. To be fair, #FBruzzesi did say "I am not sure this is optimal by any means". But let's look at how it performs.
start = time.perf_counter()
df.filter(pl.concat_str(df.columns, sep='|').is_duplicated())
end = time.perf_counter()
print(end - start)
>>> print(end - start)
37.238660977998734
Edit
Additional Alternative: filter and is_duplicated
We can also use filter with is_duplicated. Since df.is_duplicated() is not a column in the DataFrame when the filter is run, we'll need to wrap it in a polars.lit Expression.
start = time.perf_counter()
df.filter(pl.lit(df.is_duplicated()))
end = time.perf_counter()
print(end - start)
>>> print(end - start)
8.115436136999051
This performs just as well as using is_duplicated and boolean indexing.
Did this help? If nothing else, this shows some different ways to use the Polars API.
I think the optimal way really is:
df.filter(df.is_duplicated())
We have to be aware that Polars have is_duplicated() methods in the expression API and in the DataFrame API, but for the purpose of visualizing the duplicated lines we need to evaluate each column and have a consensus in the end if the column is duplicated or not.
The df.is_duplicated() will return a vector with boolean values,
It looks like this:
In []: df.is_duplicated()
Out[]:
shape: (4,)
Series: '' [bool]
[
false
true
false
true
]
In []:
Then leveraging our expression API I think we can do this:
In []: df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']])
...: df
Out[]:
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 3 │
│ 1 ┆ 7 ┆ 9 │
│ 1 ┆ 2 ┆ 3 │
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
In []: df.filter(df.is_duplicated())
Out[]:
shape: (2, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 9 │
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
And then for subseting:
In []: df.filter(df.select(["column_0", "column_1"]).is_duplicated())
Out[]:
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 7 ┆ 3 │
│ 1 ┆ 7 ┆ 9 │
│ 1 ┆ 7 ┆ 9 │
└──────────┴──────────┴──────────┘
Any other way for me seems to be trying to hack the API, I wish we had a way to "fold" the boolean columns we get by the expressions
df.filter(pl.col("*").is_duplicated())
df.filter(pl.all().is_duplicated())
It is funny, because look at this:
In []: df.select(pl.all().is_duplicated())
Out[]:
shape: (4, 3)
┌──────────┬──────────┬──────────┐
│ column_0 ┆ column_1 ┆ column_2 │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞══════════╪══════════╪══════════╡
│ true ┆ true ┆ true │
│ true ┆ true ┆ true │
│ true ┆ false ┆ true │
│ true ┆ true ┆ true │
└──────────┴──────────┴──────────┘
This is what happens when we use is_duplicated() as expression.
I am not sure this is optimal by any mean but you could concatenate all rows and check for duplicates, namely:
import polars as pl
df = pl.DataFrame([['1', '1', '1', '1'], ['7', '7', '2', '7'], ['3', '9', '3', '9']], columns=["a", "b", "c"])
df.filter(pl.concat_str(["a", "b", "c"]).is_duplicated())
shape: (2, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═════╪═════╡
│ 1 ┆ 7 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 7 ┆ 9 │
└─────┴─────┴─────┘
I'm working on a little utility app, coming from python/pandas and trying to rebuild some basic tools that can be distributed via executables. I'm having a hard time interpreting the documentation for what seems like it should be a fairly simple process of reading some raw data, resampling it based on the datetime column, and then interpolating it to fill in missing data as necessary.
My cargo.toml looks like:
[dependencies]
polars = "0.19.0"
And the code I've written so far is:
use polars::prelude::*;
use std::fs::File;
fn main() {
let mut df = CsvReader::new("raw.csv".into())
.finish();
//interpolate to clean up blank/nan
//resample/groupby 15Min-1D using mean, blank/nan if missing
let mut file = File::create("final.csv").expect("File not written!!!");
CsvWriter::new(&mut file)
.has_header(true)
.with_delimiter(b',')
.finish(&df);
}
and the raw.csv data might look like:
site,datetime,val1,val2,val3,val4,val5,val6
XX1,2021-01-01 00:45,,,,4.60,,
XX1,2021-01-01 00:50,,,,2.30,,
XX1,2021-01-01 00:53,21.90,16.00,77.67,3.45,1027.20,0.00
XX1,2021-01-01 01:20,,,,4.60,,
XX1,2021-01-01 01:53,21.90,16.00,77.67,3.45,1026.90,0.00
XX1,2021-01-01 01:55,,,,0.00,,
XX1,2021-01-01 02:00,,,,0.00,,
XX1,2021-01-01 02:45,,,,5.75,,
XX1,2021-01-01 02:50,,,,8.05,,
XX1,2021-01-01 02:53,21.00,16.00,80.69,8.05,1026.80,0.00
But I can't seem to call the methods because I get errors like:
method not found in `Result<DataFrame, PolarsError>`
or
expected struct `DataFrame`, found enum `Result`
and I'm not sure how to properly shift between classes.
I've tried obviously wrong answers like:
let grouped = df.lazy().groupby_dynamic("datetime", "1h").agg("datetime", mean());
but basically, I'm looking for the polars equivalent of pandas code:
df = df.interpolate()
df = df.resample(sample_frequency).mean()
Any help would be appreciated!
Here is an example of how you might:
upsample via a left join on a date range
filling missing values with interpolate
downsample via a groupby_dynamic
use chrono::prelude::*;
use polars::prelude::*;
use polars_core::time::*;
use std::io::Cursor;
use polars::frame::groupby::DynamicGroupOptions;
fn main() -> Result<()> {
let csv = "site,datetime,val1,val2,val3,val4,val5,val6
XX1,2021-01-01 00:45,,,,4.60,,
XX1,2021-01-01 00:50,,,,2.30,,
XX1,2021-01-01 00:53,21.90,16.00,77.67,3.45,1027.20,0.00
XX1,2021-01-01 01:20,,,,4.60,,
XX1,2021-01-01 01:53,21.90,16.00,77.67,3.45,1026.90,0.00
XX1,2021-01-01 01:55,,,,0.00,,
XX1,2021-01-01 02:00,,,,0.00,,
XX1,2021-01-01 02:45,,,,5.75,,
XX1,2021-01-01 02:50,,,,8.05,,
XX1,2021-01-01 02:53,21.00,16.00,80.69,8.05,1026.80,0.00
";
let cursor = Cursor::new(csv);
// prefer scan csv when your data is not in memory
let mut df = CsvReader::new(cursor).finish()?;
df.try_apply("datetime", |s| {
s.utf8()?
.as_datetime(Some("%Y-%m-%d %H:%M"), TimeUnit::Nanoseconds)
.map(|ca| ca.into_series())
})?;
// now we take the datetime column and extract timestamps from them
// with these timestamps we create a `date_range` with an interval of 1 minute
let dt = df.column("datetime")?;
let timestamp = dt.cast(&DataType::Int64)?;
let timestamp_ca = timestamp.i64()?;
let first = timestamp_ca.get(0).unwrap();
let last = timestamp_ca.get(timestamp_ca.len() - 1).unwrap();
let range = date_range(
first,
last,
Duration::parse("1m"),
ClosedWindow::Both,
"date_range",
TimeUnit::Nanoseconds,
);
let range_df = DataFrame::new(vec![range.into_series()])?;
// now that we got the date_range we use it to upsample the dataframe.
// after that we interpolate the missing values
// and then we groupby in a fixed time interval to get more regular output
let out = range_df
.lazy()
.join(
df.lazy(),
[col("date_range")],
[col("datetime")],
JoinType::Left,
)
.select([col("*").interpolate()])
.groupby_dynamic([], DynamicGroupOptions {
index_column: "date_range".into(),
every: Duration::parse("15m"),
period: Duration::parse("15m"),
offset: Duration::parse("0m"),
truncate: true,
include_boundaries: false,
closed_window: ClosedWindow::Left,
}).agg([col("*").first()])
.collect()?;
dbg!(out);
Ok(())
}
These are the features I used:
["csv-file", "pretty_fmt", "temporal", "dtype-date", "dtype-datetime", "lazy", "interpolate", "dynamic_groupby"]
Output
This ouputs
[src/main.rs:68] out = shape: (9, 9)
┌─────────────────────┬─────────────────────┬────────────┬────────────┬─────┬────────────┬────────────┬─────────────┬────────────┐
│ date_range ┆ date_range_first ┆ site_first ┆ val1_first ┆ ... ┆ val3_first ┆ val4_first ┆ val5_first ┆ val6_first │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ datetime[ns] ┆ str ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════════════════╪═════════════════════╪════════════╪════════════╪═════╪════════════╪════════════╪═════════════╪════════════╡
│ 2021-01-01 00:45:00 ┆ 2021-01-01 00:45:00 ┆ XX1 ┆ null ┆ ... ┆ null ┆ 4.6 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 01:00:00 ┆ 2021-01-01 01:00:00 ┆ null ┆ 21.9 ┆ ... ┆ 77.67 ┆ 4.6 ┆ 1027.165 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 01:15:00 ┆ 2021-01-01 01:15:00 ┆ null ┆ 21.9 ┆ ... ┆ 77.67 ┆ 4.6 ┆ 1027.09 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 01:30:00 ┆ 2021-01-01 01:30:00 ┆ null ┆ 21.9 ┆ ... ┆ 77.67 ┆ 4.251515 ┆ 1027.015 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 02:00:00 ┆ 2021-01-01 02:00:00 ┆ XX1 ┆ 21.795 ┆ ... ┆ 78.022333 ┆ 0.0 ┆ 1027.153333 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 02:15:00 ┆ 2021-01-01 02:15:00 ┆ null ┆ 21.57 ┆ ... ┆ 78.777333 ┆ 4.983333 ┆ 1027.053333 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 02:30:00 ┆ 2021-01-01 02:30:00 ┆ null ┆ 21.345 ┆ ... ┆ 79.532333 ┆ 5.366667 ┆ 1026.953333 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-01-01 02:45:00 ┆ 2021-01-01 02:45:00 ┆ XX1 ┆ 21.12 ┆ ... ┆ 80.287333 ┆ 5.75 ┆ 1026.853333 ┆ 0.0 │
└─────────────────────┴─────────────────────┴────────────┴────────────┴─────┴────────────┴────────────┴─────────────┴────────────┘
Note that polars_core also is needed for the time module, this will be exported to polars next patch.
Try something like this:
use polars::prelude::*;
use std::fs::File;
fn main() {
let mut df = CsvReader::new("raw.csv".into())
.finish()
.unwrap();
//interpolate to clean up blank/nan
//resample/groupby 15Min-1D using mean, blank/nan if missing
let mut file = File::create("final.csv").expect("File not written!!!");
CsvWriter::new(&mut file)
.has_header(true)
.with_delimiter(b',')
.finish(&df)
.unwrap();
}
In general, if you get an error about Result, try adding .unwrap() or .expect(...).
I have a dataframe in which one of the columns is of type interval[float64] and I wish to filter that dataframe to retreive on of the bins:
distance_bin object:s in bin percentage Date
0 (-0.001, 1.0] 1092 46.00 2021-06-20
1 (1.0, 2.0] 533 22.45 2021-06-20
2 (2.0, 3.0] 257 10.83 2021-06-20
3 (3.0, 4.0] 46 1.94 2021-06-20
4 (4.0, 5.0] 144 6.07 2021-06-20
5 (5.0, 6.0] 117 4.93 2021-06-20
6 (6.0, 7.0] 58 2.44 2021-06-20
7 (7.0, 8.0] 46 1.94 2021-06-20
8 (8.0, 9.0] 48 2.02 2021-06-20
9 (9.0, 10.0] 16 0.67 2021-06-20
10 (10.0, 11.0] 6 0.25 2021-06-20
11 (11.0, 12.0] 8 0.34 2021-06-20
12 (12.0, 13.0] 1 0.04 2021-06-20
13 (15.0, 16.0] 1 0.04 2021-06-20
14 (21.0, 22.0] 1 0.04 2021-06-20
15 (-0.001, 1.0] 1080 45.55 2021-06-21
16 (1.0, 2.0] 546 23.03 2021-06-21
17 (2.0, 3.0] 245 10.33 2021-06-21
18 (3.0, 4.0] 54 2.28 2021-06-21
19 (4.0, 5.0] 150 6.33 2021-06-21
20 (5.0, 6.0] 104 4.39 2021-06-21
21 (6.0, 7.0] 44 1.86 2021-06-21
22 (7.0, 8.0] 51 2.15 2021-06-21
23 (8.0, 9.0] 38 1.60 2021-06-21
24 (9.0, 10.0] 36 1.52 2021-06-21
25 (10.0, 11.0] 7 0.30 2021-06-21
26 (11.0, 12.0] 15 0.63 2021-06-21
27 (12.0, 13.0] 1 0.04 2021-06-21
28 (-0.001, 1.0] 1094 46.24 2021-06-22
29 (1.0, 2.0] 517 21.85 2021-06-22
30 (2.0, 3.0] 289 12.21 2021-06-22
31 (3.0, 4.0] 42 1.78 2021-06-22
32 (4.0, 5.0] 139 5.87 2021-06-22
33 (5.0, 6.0] 98 4.14 2021-06-22
34 (6.0, 7.0] 43 1.82 2021-06-22
35 (7.0, 8.0] 47 1.99 2021-06-22
36 (8.0, 9.0] 46 1.94 2021-06-22
37 (9.0, 10.0] 30 1.27 2021-06-22
38 (10.0, 11.0] 6 0.25 2021-06-22
39 (11.0, 12.0] 15 0.63 2021-06-22
To give you a way to transform the interval in the right datatype, I provide you with this code:
def interval_type(s):
table = str.maketrans({'[': '(', ']': ')'})
left_closed = s.startswith('[')
right_closed = s.endswith(']')
left, right = ast.literal_eval(s.translate(table))
t = 'neither'
if left_closed and right_closed:
t = 'both'
elif left_closed:
t = 'left'
elif right_closed:
t = 'right'
return pd.Interval(left, right, closed=t)
df['distance_bin'] = df['distance_bin'].apply(interval_type)
Now, I thought that to filter df to only keep the interval (5.0, 6.0] it would be enough to do the following thing:
df_A = df_smh[df['distance_bin']=='(5.0, 6.0]']
But it returns an empty dataframe. I tried a multitude of other things, like:
df_A = df_smh[df.distance_bin.left>5 & df.distance_bin.right<=6]
but this returns:
AttributeError: 'Series' object has no attribute 'left'
Any guidance here would be greatly appreciated.
For me working comparing by Interval like:
df = pd.DataFrame({'distance_bin':[pd.Interval(-0.001, 1.0, closed='right'),
pd.Interval(1.0, 2.0, closed='right'),
pd.Interval(2.0, 3.0, closed='right'),
pd.Interval(3.0, 4.0, closed='right'),
pd.Interval(4.0, 5.0, closed='right')]})
df = df[df['distance_bin']==pd.Interval(1.0, 2.0, closed='right')]
print (df)
distance_bin
1 (1.0, 2.0]
Most numbering systems start with zero, go through the base-10 digits, and then go to letters once the base-10 digits have been exhausted:
Binary: 0,1
Octal: 0,1,2,3,4,5,6,7
Decimal: 0,1,2,3,4,5,6,7,8,9
Hexidecimal: 0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F
Even the ascii order of characters has numbers come before letters.
The Base64 encoding scheme does things differently:
┌──────┬──────────┬┬──────┬──────────┬┬──────┬──────────┬┬──────┬──────────┐
│Value │ Encoding ││Value │ Encoding ││Value │ Encoding ││Value │ Encoding │
├──────┼──────────┼┼──────┼──────────┼┼──────┼──────────┼┼──────┼──────────┤
│ 0 │ A ││ 17 │ R ││ 34 │ i ││ 51 │ z │
│ 1 │ B ││ 18 │ S ││ 35 │ j ││ 52 │ 0 │
│ 2 │ C ││ 19 │ T ││ 36 │ k ││ 53 │ 1 │
│ 3 │ D ││ 20 │ U ││ 37 │ l ││ 54 │ 2 │
│ 4 │ E ││ 21 │ V ││ 38 │ m ││ 55 │ 3 │
│ 5 │ F ││ 22 │ W ││ 39 │ n ││ 56 │ 4 │
│ 6 │ G ││ 23 │ X ││ 40 │ o ││ 57 │ 5 │
│ 7 │ H ││ 24 │ Y ││ 41 │ p ││ 58 │ 6 │
│ 8 │ I ││ 25 │ Z ││ 42 │ q ││ 59 │ 7 │
│ 9 │ J ││ 26 │ a ││ 43 │ r ││ 60 │ 8 │
│ 10 │ K ││ 27 │ b ││ 44 │ s ││ 61 │ 9 │
│ 11 │ L ││ 28 │ c ││ 45 │ t ││ 62 │ + │
│ 12 │ M ││ 29 │ d ││ 46 │ u ││ 63 │ / │
│ 13 │ N ││ 30 │ e ││ 47 │ v ││ │ │
│ 14 │ O ││ 31 │ f ││ 48 │ w ││(pad) │ = │
│ 15 │ P ││ 32 │ g ││ 49 │ x ││ │ │
│ 16 │ Q ││ 33 │ h ││ 50 │ y ││ │ │
└──────┴──────────┴┴──────┴──────────┴┴──────┴──────────┴┴──────┴──────────┘
Is there a reason why base64 chose to do letters before numbers? Wouldn't it have made more sense for the value 0 to be represented by the encoding 0?