Resample time series using Polars in Rust

Resample time series using Polars in Rust - rust

I'm trying to learn rust by doing some data parsing and re-work some of my trading tools but got stuck pretty quick.
I want to resample my data from 5 min to 15 min and Polars seems to be able to do this in an optimized way.
This is my try so far. I manage to group the times from 5 min into 15 min but I cannot wrap my head around how to apply this grouping to the other columns.
use polars::prelude::*;
use std::error::Error;
fn type_of<T>(_: T) {
println!("--- DATA TYPE: {}", std::any::type_name::<T>());
}
fn main() -> Result<(), Box<dyn Error>> {
let path = "path/to/.csv";
let df = LazyCsvReader::new(path)
.has_header(true)
.with_parse_dates(true)
.finish()?
.fetch(20)?;
let tt = df.groupby_dynamic(
vec![df.column("close")?.clone()],
&DynamicGroupOptions {
index_column: "time".into(),
every: Duration::parse("15m"),
period: Duration::parse("5m"),
offset: Duration::parse("0s"),
truncate: true,
include_boundaries: false,
closed_window: ClosedWindow::Left,
},
)?;
//type_of(&tt);
println!("{:?}", tt);
}
OUTPUT
Series: 'time' [datetime[μs]]
[
2019-09-03 09:00:00
2019-09-03 09:15:00
2019-09-03 09:30:00
2019-09-03 09:45:00
2019-09-03 10:00:00
2019-09-03 10:15:00
2019-09-03 10:30:00
2019-09-03 10:45:00
2019-09-03 11:00:00
], [], Slice { groups: [[0, 1], [1, 1], [1, 4], [4, 4], [7, 4], [10, 4], [13, 4], [16, 4], [19, 1]], rolling: false })
As soon as i try to add a series I want to group in the "by" field (first argument in groupby_dynamic) no resampling is taking part, I only get the same series as the put it.
The function outputs a Slice { groups: ..." which is of type polars_core::frame::groupby::proxy::GroupsProxy
But I don't know how I should handle it.
My cargo.toml:
[dependencies]
polars = { version = "0.25.1", features = ["lazy"] }
My .csv-file:
time,open,high,low,close,volume
2019-09-03 09:00:00,1183.9999,1183.9999,1183.9999,1183.9999,150
2019-09-03 09:30:00,1178.69,1180.69,1178.47,1178.47,5180
2019-09-03 09:35:00,1177.03,1180.6146,1176.0,1179.47,70575
2019-09-03 09:40:00,1180.6345,1186.89,1180.6345,1185.5141,37267
2019-09-03 09:45:00,1185.9,1186.43,1182.43,1182.47,20569
2019-09-03 09:50:00,1183.54,1184.0,1180.0,1181.96,20754
2019-09-03 09:55:00,1182.5,1186.0,1182.49,1184.83,20848
2019-09-03 10:00:00,1185.5,1185.59,1184.03,1185.145,18581
2019-09-03 10:05:00,1184.65,1184.65,1175.5,1175.86,27714
2019-09-03 10:10:00,1175.49,1176.5,1173.65,1175.47,21779
2019-09-03 10:15:00,1175.295,1177.42,1173.5,1173.68,13588
2019-09-03 10:20:00,1173.01,1176.3717,1173.01,1175.44,9853
2019-09-03 10:25:00,1175.7896,1178.985,1175.7896,1177.468,7866
2019-09-03 10:30:00,1178.05,1179.0,1176.0038,1178.72,11576
2019-09-03 10:35:00,1179.005,1179.005,1176.53,1177.0077,9275
2019-09-03 10:40:00,1177.18,1178.02,1176.0201,1178.02,8852
2019-09-03 10:45:00,1178.3,1182.5,1178.3,1181.7113,14703
2019-09-03 10:50:00,1181.74,1181.9952,1180.01,1181.738,10225
2019-09-03 10:55:00,1182.11,1183.428,1181.33,1183.428,7835
2019-09-03 11:00:00,1183.41,1184.665,1183.41,1184.24,9078
The next thing would be to get the first,last,max,min out of the "closed" columns (yes, i'm trying to get OHLC candles).
Happy for any help!

Finally, I solved it!
I had not used the Lazy-API correctly. I had and somehow used some other implementation of group_dynamic, I guess it is used for the inner workings of Polars or something. The main point was that I forgot .lazy()
However, this is how I solved it:
use polars::prelude::*;
use std::error::Error;
fn type_of<T>(_: T) {
println!("--- DATA TYPE: {}", std::any::type_name::<T>());
}
fn main() -> Result<(), Box<dyn Error>> {
let path = "/home/niklas/projects/ML-trader/Datasets/Raw/GOOG.csv";
let df = LazyCsvReader::new(path)
.has_header(true)
.with_parse_dates(true)
.finish()?
.collect()?;
type_of(&df);
// println!("{}", &df["close"]);
let tt = df
.lazy()
.groupby_dynamic(
vec![],
DynamicGroupOptions {
index_column: "time".into(),
every: Duration::parse("15m"),
period: Duration::parse("15m"),
offset: Duration::parse("0s"),
truncate: false,
include_boundaries: false,
closed_window: ClosedWindow::Left,
},
)
.agg([
col("close").first().alias("firstClose"),
col("close").last().alias("lastClose"),
col("close"),
])
.fetch(20);
println!("{:?}", tt);
Ok(())
}
OUTPUT
--- DATA TYPE: &polars_core::frame::DataFrame
--- DATA TYPE: &core::result::Result<polars_core::frame::DataFrame, polars_core::error::PolarsError>
Ok(shape: (8, 4)
┌─────────────────────┬────────────┬───────────┬─────────────────────────────────┐
│ time ┆ firstClose ┆ lastClose ┆ close │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ f64 ┆ list[f64] │
╞═════════════════════╪════════════╪═══════════╪═════════════════════════════════╡
│ 2019-09-03 09:00:00 ┆ 1183.9999 ┆ 1183.9999 ┆ [1183.9999] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-09-03 09:30:00 ┆ 1178.47 ┆ 1185.5141 ┆ [1178.47, 1179.47, 1185.5141] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-09-03 09:45:00 ┆ 1182.47 ┆ 1184.83 ┆ [1182.47, 1181.96, 1184.83] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-09-03 10:00:00 ┆ 1185.145 ┆ 1175.47 ┆ [1185.145, 1175.86, 1175.47] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-09-03 10:15:00 ┆ 1173.68 ┆ 1177.468 ┆ [1173.68, 1175.44, 1177.468] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-09-03 10:30:00 ┆ 1178.72 ┆ 1178.02 ┆ [1178.72, 1177.0077, 1178.02] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-09-03 10:45:00 ┆ 1181.7113 ┆ 1183.428 ┆ [1181.7113, 1181.738, 1183.428] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2019-09-03 11:00:00 ┆ 1184.24 ┆ 1184.24 ┆ [1184.24] │
└─────────────────────┴────────────┴───────────┴─────────────────────────────────┘)
my .toml
[dependencies]
polars = { version = "0.25.1", features = ["lazy", "dynamic_groupby"] }
Thank y'all for coming to my ted talk!

Related

Apply a fucntion with lambda to Pandas

I have 2 time-series and I would like to find the nearest date from each date in time-series1 to time-series2. I found how to do it separately per date, but I would like to apply it to the entire time-series1. They are in two different dataframes called o and p
This is how my data looks like:
Time-series1:
o['date']
>>>0 2020-01-26
1 2020-01-28
2 2020-01-31
3 2020-02-15
4 2020-02-17
...
86 2021-01-10
87 2021-01-20
88 2021-01-27
89 2021-01-30
90 2021-02-14
Name: date, Length: 91, dtype: datetime64[ns]
Time-series2:
p['date']
>>>1 2020-02-17
3 2020-03-02
4 2020-03-03
5 2020-03-04
6 2020-03-05
...
172 2021-01-30
173 2021-02-06
174 2021-02-07
177 2021-02-12
179 2021-02-14
Name: date, Length: 144, dtype: datetime64[ns]
The function that I use:
def nearest(pivot,items):
return min(items, key=lambda x: abs(x - pivot))
Which works on a separate singular date, for example:
nearest(o['date'][6], p['date'])
>>>Timestamp('2020-03-02 00:00:00')
When I try to apply it to the whole pandas Series I get an error:
o['date'].apply(nearest, args=(p['date']))
>>>---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-82c86ffd48ff> in <module>()
----> 1 o['date'].apply(nearest, args=(p['date']))
C:\Users\ran\Anaconda3\envs\main\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
4180
4181 # handle ufuncs and lambdas
-> 4182 if kwds or args and not isinstance(func, np.ufunc):
4183
4184 def f(x):
C:\Users\ran\Anaconda3\envs\main\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
1325 def __nonzero__(self):
1326 raise ValueError(
-> 1327 f"The truth value of a {type(self).__name__} is ambiguous. "
1328 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
1329 )
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I feel that there is something basic I'm missing.
I guess I can do:
[nearest(x, p['date']) for x in o['date']]
But I would like to know how to apply it to a Pandas Series

Use Series.apply with lambda function:
s = o['date'].apply(lambda x: nearest(x, p['date']))
Or with args parameter:
s = o['date'].apply(nearest, args=(p['date'], ))
Numpy alternative with numpy.argmin should be faster:
a = o['date'].to_numpy()
b = p['date'].to_numpy()
pos = np.argmin(np.abs(a- b[:, None]), axis=0)
s = pd.Series(b[pos], index=o.index)

How to Vectorizationdataframe pandas daframe with list in condition?

I would like to Vectorization my dataframe with NumPy arrays but I got an error
Here is the code :
Here I initialize my dataframe
df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': [True,False,True,False],
'E': pd.Categorical(["test", "Draft", "test", "Draft"]),
'F': 'foo'})
df2
output:
A B C D E F
0 1.0 2013-01-02 1.0 True test foo
1 1.0 2013-01-02 1.0 False Draft foo
2 1.0 2013-01-02 1.0 True test foo
3 1.0 2013-01-02 1.0 False train foo
Here I define the function to apply to dataframe's columns
def IsBillingValid2(xE,yBilling):
if(xE not in ['Draft','Cancelled'] and yBilling==True): #Order Edited
return True
else:
return False
Here I launch my function
df2['BillingPostalCode_Det_StageName_Det']=IsBillingValid(df2['E'].values,df2['D'].values)
Here is the Error:
output:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<command-2946139111570059> in <module>
16 return False
17
---> 18 df2['BillingPostalCode_Det_StageName_Det']=IsBillingValid(df2['E'].values,df2['D'].values)
19
<command-2041881674588848> in IsBillingValid(xStageName, yBilling)
207 def IsBillingValid(xStageName,yBilling):
208
--> 209 if(xStageName not in ['Draft','Cancelled'] and yBilling==True): #Order Edited
210 return True
211
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Thanks for your help

You don't need apply especially when you want vectorized operation.
Use pandas.Series.isin:
df2['BillingPostalCode_Det_StageName_Det'] = ~df2["E"].isin({"Draft", "Cancelled"}) & df2["D"]
print(df2)
Output:
A B C D E F BillingPostalCode_Det_StageName_Det
0 1.0 2013-01-02 1.0 True test foo True
1 1.0 2013-01-02 1.0 False Draft foo False
2 1.0 2013-01-02 1.0 True test foo True
3 1.0 2013-01-02 1.0 False Draft foo False

Slicing Pandas series

I have the following code:
total_csv = pd.read_csv('total.csv',header=0).iloc[:,:]
column28=total_csv ['28']
column27=total_csv ['27']
column26=total_csv ['26']
column25=total_csv ['25']
column24=total_csv ['24']
column23=total_csv ['23']
master_values=(column23,column24,column25,column26,column27,column28)
In [68]:master_values
Out[68]:
(0 6867.488928
Name: 23, dtype: float64, 0 6960.779317
Name: 24, dtype: float64, 0 7007.540137
Name: 25, dtype: float64, 0 7031.11444
Name: 26, dtype: float64, 0 7127.469389
Name: 27, dtype: float64, 0 7408.207806
Name: 28, dtype: float64)
But I want master_values to be (6867.488928,6960.779317,7007.540137,7031.11444,7127.469389,7408.207806).
Currently, the way I read total_csv is the following:
In [69]: total_csv
Out[69]:
z 23 24 25 ...
0 CCS 6867.488928 6960.779317 7031.11444 ...
How could I read master_values to be (6867.488928,6960.779317,7007.540137,7031.11444,7127.469389,7408.207806)?

Are the columnXX variables necessary ?
Maybe just try following:
master_values = pd.read_csv('total.csv',header=0).iloc[0]
and if you need a tuple as indicated by the parentheses you can do it like that:
master_values = tuple(pd.read_csv('total.csv',header=0).iloc[0])

You could try this:
total_csv.to_numpy()[0][0].split(' ')[1:]

how to control store arrays in pickle python

I store some from the images in the database as 128 vector arrays. My problem is that when I put new images in the dataset or delete images from dataset, pickle re-save the images that you previously saved and does not know that the array vectors has already been saved to it.
This causes When I have a lot of pictures in the dataset, it's time to spend on saving them in the pickle. How can I fix it?
├── dataset
│ ├── jack [10 entries]
│ ├── john [7 entries]
│ ├── mori [24 entries]
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--dataset", required=True,help="path to input directory of faces + images")
ap.add_argument("-e", "--encodings", required=True,help="path to serialized db of facial encodings")
features = []
faces = []
for (i, imagePath) in enumerate(imagePaths):
name = imagePath.split(os.path.sep)[-2]
encodings = face_recognition.face_encodings(rgb, boxes)
for encoding in encodings:
knownEncodings.append(encoding)
knownNames.append(name)
data = {"encodings": knownEncodings, "names": knownNames}
f = open(args["encodings"], "wb")
f.write(pickle.dumps(data))
f.close()

Drop duplicates from Structured Numpy Array Python3.x

Take the following Array:
import numpy as np
arr_dupes = np.array(
[
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 222),
('2017-09-13T11:04:00.000000', 1.32683, 1.32686, 1.32682, 1.32685, 1.32682, 1.32684, 1.3268 , 1.32684, 97),
('2017-09-13T11:03:00.000000', 1.32664, 1.32684, 1.32663, 1.32683, 1.32664, 1.32683, 1.32661, 1.32682, 268),
('2017-09-13T11:02:00.000000', 1.3268 , 1.32692, 1.3266 , 1.32664, 1.32678, 1.32689, 1.32658, 1.32664, 299),
('2017-09-13T11:02:00.000000', 1.3268 , 1.32692, 1.3266 , 1.32664, 1.32678, 1.32689, 1.32658, 1.32664, 299),
('2017-09-13T11:01:00.000000', 1.32648, 1.32682, 1.32648, 1.3268 , 1.32647, 1.32682, 1.32647, 1.32678, 322),
('2017-09-13T11:00:00.000000', 1.32647, 1.32649, 1.32628, 1.32648, 1.32644, 1.32651, 1.32626, 1.32647, 285)],
dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)
What is the fastest solution to remove duplicates, using the dates as an index and keeping the last value?
Pandas DataFrame equivalent is
In [5]: df = pd.DataFrame(arr_dupes, index=arr_dupes['date'])
In [6]: df
Out[6]:
date askopen askhigh asklow askclose bidopen bidhigh bidlow bidclose volume
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 246
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 246
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 222
2017-09-13 11:04:00 2017-09-13 11:04:00 1.32683 1.32686 1.32682 1.32685 1.32682 1.32684 1.32680 1.32684 97
2017-09-13 11:03:00 2017-09-13 11:03:00 1.32664 1.32684 1.32663 1.32683 1.32664 1.32683 1.32661 1.32682 268
2017-09-13 11:02:00 2017-09-13 11:02:00 1.32680 1.32692 1.32660 1.32664 1.32678 1.32689 1.32658 1.32664 299
2017-09-13 11:02:00 2017-09-13 11:02:00 1.32680 1.32692 1.32660 1.32664 1.32678 1.32689 1.32658 1.32664 299
2017-09-13 11:01:00 2017-09-13 11:01:00 1.32648 1.32682 1.32648 1.32680 1.32647 1.32682 1.32647 1.32678 322
2017-09-13 11:00:00 2017-09-13 11:00:00 1.32647 1.32649 1.32628 1.32648 1.32644 1.32651 1.32626 1.32647 285
In [7]: df.reset_index().drop_duplicates(subset='date', keep='last').set_index('date')
Out[7]:
index askopen askhigh asklow askclose bidopen bidhigh bidlow bidclose volume
date
2017-09-13 11:05:00 2017-09-13 11:05:00 1.32685 1.32704 1.32682 1.32686 1.32684 1.32702 1.32679 1.32683 222
2017-09-13 11:04:00 2017-09-13 11:04:00 1.32683 1.32686 1.32682 1.32685 1.32682 1.32684 1.32680 1.32684 97
2017-09-13 11:03:00 2017-09-13 11:03:00 1.32664 1.32684 1.32663 1.32683 1.32664 1.32683 1.32661 1.32682 268
2017-09-13 11:02:00 2017-09-13 11:02:00 1.32680 1.32692 1.32660 1.32664 1.32678 1.32689 1.32658 1.32664 299
2017-09-13 11:01:00 2017-09-13 11:01:00 1.32648 1.32682 1.32648 1.32680 1.32647 1.32682 1.32647 1.32678 322
2017-09-13 11:00:00 2017-09-13 11:00:00 1.32647 1.32649 1.32628 1.32648 1.32644 1.32651 1.32626 1.32647 285
numpy.unique seems to compare the entire tuple and will return duplicates.
Final output should look like this.
array([
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 222),
('2017-09-13T11:04:00.000000', 1.32683, 1.32686, 1.32682, 1.32685, 1.32682, 1.32684, 1.3268 , 1.32684, 97),
('2017-09-13T11:03:00.000000', 1.32664, 1.32684, 1.32663, 1.32683, 1.32664, 1.32683, 1.32661, 1.32682, 268),
('2017-09-13T11:02:00.000000', 1.3268 , 1.32692, 1.3266 , 1.32664, 1.32678, 1.32689, 1.32658, 1.32664, 299),
('2017-09-13T11:01:00.000000', 1.32648, 1.32682, 1.32648, 1.3268 , 1.32647, 1.32682, 1.32647, 1.32678, 322),
('2017-09-13T11:00:00.000000', 1.32647, 1.32649, 1.32628, 1.32648, 1.32644, 1.32651, 1.32626, 1.32647, 285)],
dtype=[('date', '<M8[us]'), ('askopen', '<f8'), ('askhigh', '<f8'), ('asklow', '<f8'), ('askclose', '<f8'),
('bidopen', '<f8'), ('bidhigh', '<f8'), ('bidlow', '<f8'), ('bidclose', '<f8'), ('volume', '<i8')]
)
Thank-you

It seems that the solution to your problem doesn't have to mimic pandas drop_duplicates() function, but I'll provide one that mimics it and one that doesn't.
If you need the exact same behavior as pandas drop_duplicates() then the following code is a way to go:
#initialization of arr_dupes here
#actual algorithm
helper1, helper2 = np.unique(arr_dupes['date'][::-1], return_index = True)
result = arr_dupes[::-1][helper2][::-1]
When arr_dupes is initialized you need to pass only the 'date' column to numpy.unique(). Also since you are interested in the last of non-unique elements in an array you have to reverse the order of the array that you pass to unique() with [::-1]. This way unique() will throw out every non-unique element except last one.
Then unique() returns a list of unique elements (helper1) as first return value and a list of indices of those elements in original array (helper2) as second return value.
Lastly a new array is created by picking elements listed in helper2 from the original array arr_dupes.
This solution is about 9.898 times faster than pandas version.
Now let me explain what I meant in the beginning of this answer. It seems to me that your array is sorted by the 'date' column. If it is true then we can assume that duplicates are going to be grouped together. If they are grouped together then we only need to keep rows whose next rows 'date' column is different than the current rows 'date' column. So for example if we take a look at the following array rows:
...
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 246),
('2017-09-13T11:05:00.000000', 1.32685, 1.32704, 1.32682, 1.32686, 1.32684, 1.32702, 1.32679, 1.32683, 222),
('2017-09-13T11:04:00.000000', 1.32683, 1.32686, 1.32682, 1.32685, 1.32682, 1.32684, 1.3268 , 1.32684, 97),
...
The third rows 'date' column is different than the fourths and we need to keep it. No need to do any more checks. First rows 'date' column is the same as the second rows and we don't need that row. Same goes for the second row.
So in code it looks like this:
#initialization of arr_dupes here
#actual algorithm
result = arr_dupes[np.concatenate((arr_dupes['date'][:-1] != arr_dupes['date'][1:], np.array([True])))]
First every element of a 'date' column is compared with the next element. This creates an array of trues and falses. If an index in this boolean array has a true asigned to it then an arr_dupes element with that index needs to stay. Otherwise it needs to go.
Next, concatenate() just adds one last true value to this boolean array since last element always needs to stay in the resulting array.
This solution is about 17 times faster than pandas version.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Resample time series using Polars in Rust - rust

Related

Apply a fucntion with lambda to Pandas

How to Vectorizationdataframe pandas daframe with list in condition?

Slicing Pandas series

how to control store arrays in pickle python

Drop duplicates from Structured Numpy Array Python3.x

Categories

Resources