Fastest, most efficient way to aggregate a large dataset in python

Fastest, most efficient way to aggregate a large dataset in python - python-3.x

Let's say I'm measuring the speed over time of a car moving forward on a single axis, with a new measure every 10 minutes.
I have a column in my DataFrame called delta_x, which contains how much the car moved on my axis in the last 10 minutes, values are integers only.
Now let's say that I want to aggregate my data, and have only the amount of movement over each hour, but I want to optimize my code as much as possible because my dataset is extremely large, what's the most efficient way to achieve that ?
df.head(9)
date time delta_x
0 01/01/2018 00:00 9
1 01/01/2018 00:10 9
2 01/01/2018 00:20 9
3 01/01/2018 00:30 9
4 01/01/2018 00:40 11
5 01/01/2018 00:50 12
6 01/01/2018 01:00 10
7 01/01/2018 01:10 10
8 01/01/2018 01:20 10
Currently my solution is to do the following
for file in os.listdir('temp'):
if(file.endswith('.txt'):
df = pd.read_csv(''.join(["./temp/",file]), header=None, delim_whitespace=True)
df.columns = ['date', 'time', 'delta_x']
df['hour'] = [(datetime.strptime(x, "%H:%M")).hour for x in df['time'].values]
df = df.groupby(['date','hour']).agg({'delta_x': 'sum'})
Which outputs the correct:
date hour delta_x
01/01/2018 0 59
But I was wondering, is there a better, faster and more efficient way, perhaps using NumPy ?

You can try with following packages which are used for speeding up pandas operation
https://github.com/jmcarpenter2/swifter
https://github.com/modin-project/modin

Related

Computing 10d rolling average on a descending date column in pandas [duplicate]

Suppose I have a time series:
In[138] rng = pd.date_range('1/10/2011', periods=10, freq='D')
In[139] ts = pd.Series(randn(len(rng)), index=rng)
In[140]
Out[140]:
2011-01-10 0
2011-01-11 1
2011-01-12 2
2011-01-13 3
2011-01-14 4
2011-01-15 5
2011-01-16 6
2011-01-17 7
2011-01-18 8
2011-01-19 9
Freq: D, dtype: int64
If I use one of the rolling_* functions, for instance rolling_sum, I can get the behavior I want for backward looking rolling calculations:
In [157]: pd.rolling_sum(ts, window=3, min_periods=0)
Out[157]:
2011-01-10 0
2011-01-11 1
2011-01-12 3
2011-01-13 6
2011-01-14 9
2011-01-15 12
2011-01-16 15
2011-01-17 18
2011-01-18 21
2011-01-19 24
Freq: D, dtype: float64
But what if I want to do a forward-looking sum? I've tried something like this:
In [161]: pd.rolling_sum(ts.shift(-2, freq='D'), window=3, min_periods=0)
Out[161]:
2011-01-08 0
2011-01-09 1
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
Freq: D, dtype: float64
But that's not exactly the behavior I want. What I am looking for as an output is:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
ie - I want the sum of the "current" day plus the next two days. My current solution is not sufficient because I care about what happens at the edges. I know I could solve this manually by setting up two additional columns that are shifted by 1 and 2 days respectively and then summing the three columns, but there's got to be a more elegant solution.

Why not just do it on the reversed Series (and reverse the answer):
In [11]: pd.rolling_sum(ts[::-1], window=3, min_periods=0)[::-1]
Out[11]:
2011-01-10 3
2011-01-11 6
2011-01-12 9
2011-01-13 12
2011-01-14 15
2011-01-15 18
2011-01-16 21
2011-01-17 24
2011-01-18 17
2011-01-19 9
Freq: D, dtype: float64

I struggled with this then found an easy way using shift.
If you want a rolling sum for the next 10 periods, try:
df['NewCol'] = df['OtherCol'].shift(-10).rolling(10, min_periods = 0).sum()
We use shift so that "OtherCol" shows up 10 rows ahead of where it normally would be, then we do a rolling sum over the previous 10 rows. Because we shifted, the previous 10 rows are actually the future 10 rows of the unshifted column. :)

Pandas recently added a new feature which enables you to implement forward looking rolling. You have to upgrade to pandas 1.1.0 to get the new feature.
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
ts.rolling(window=indexer, min_periods=1).sum()

Maybe you can try bottleneck module. When ts is large, bottleneck is much faster than pandas
import bottleneck as bn
result = bn.move_sum(ts[::-1], window=3, min_count=1)[::-1]
And bottleneck has other rolling functions, such as move_max, move_argmin, move_rank.

Try this one for a rolling window of 3:
window = 3
ts.rolling(window).sum().shift(-window + 1)

Pandas : Finding correct time window

I have a pandas dataframe which gets updated every hour with latest hourly data. I have to filter out IDs based upon a threshold, i.e. PR_Rate > 50 and CNT_12571 < 30 for 3 consecutive hours from a lookback period of 5 hours. I was using the below statements to accomplish this:
df_thld=df[(df['Date'] > df['Date'].max() - pd.Timedelta(hours=5))& (df.PR_Rate>50) & (df.CNT_12571 < 30)]
df_thld.loc[:,'HR_CNT'] = df_thld.groupby('ID')['Date'].nunique().to_frame('HR_CNT').reset_index()
df_thld[(df_thld['HR_CNT'] >3]
The problem with this approach is that since lookback period requirement is 5 hours, so, this HR_CNT can count any non consecutive hours breaching this critieria.
MY Dataset is as below:
DataFrame
Date IDs CT_12571 PR_Rate
16/06/2021 10:00 A1 15 50.487
16/06/2021 11:00 A1 31 40.806
16/06/2021 12:00 A1 25 52.302
16/06/2021 13:00 A1 13 61.45
16/06/2021 14:00 A1 7 73.805
In the above Dataframe, threshold was not breached at 1100 hrs, but while counting the hours, 10,12 and 13 as the hours that breached the threshold instead of 12,13,14 as required. Each id may or may not have this critieria breached in a single day. Any idea, How can I fix this issue?

Please excuse me, if I have misinterpreted your problem. As I understand the issues you have a dataframe which is updated hourly. An example of this dataframe is illustrated below as df. From this dataframe, you want to filter only those rows which satisfy the following two conditions:
PR_Rate > 50 and CNT_12571 < 30
If and only if the threshold is surpassed for three consecutive hours
Given these assumptions, I would proceed as follows:
df:
Date IDs CT_1257 PR_Rate
0 2021-06-16 10:00:00 A1 15 50.487
1 2021-06-16 12:00:00 A1 31 40.806
2 2021-06-16 14:00:00 A1 25 52.302
3 2021-06-16 15:00:00 A1 13 61.450
4 2021-06-16 16:00:00 A1 7 73.805
Note in this dataframe, the only time fr5ame which satisfies the above conditions is the entries for the of 14:00, 15:00 and 16:00.
def filterFrame(df, dur, pr_threshold, ct_threshold):
ff = df[(df['CT_1257']< ct_threshold) & (df['PR_Rate'] >pr_threshold) ].reset_index()
ml = list(ff.rolling(f'{dur}h', on='Date').count()['IDs'])
r = len(ml)- 1
rows= []
while r >= 0:
end = r
start = None
if int(ml[r]) < dur:
r -= 1
else:
k = int(ml[r])
for i in range(k):
rows.append(r-i)
r -= k
rows = rows[::-1]
return ff.filter(items= rows, axis = 0).reset_index()
running filterFrame(df, 3, 50, 30) yields:
level_0 index Date IDs CT_1257 PR_Rate
0 1 2 2021-06-16 14:00:00 A1 25 52.302
1 2 3 2021-06-16 15:00:00 A1 13 61.450
2 3 4 2021-06-16 16:00:00 A1 7 73.805

How to group by an Attribute and calculate time between consecutive tickets for that Attribute

So, I am working with a Dataframe where there are around 20 columns, but only two columns are really of importance.
Index
ID
Date
1
01-40-50
2021-12-01 16:54:00
2
01-10
2021-10-11 13:28:00
3
03-48-58
2021-11-05 16:54:00
4
01-40-50
2021-12-06 19:34:00
5
03-48-58
2021-12-09 12:14:00
6
01-10
2021-08-06 19:34:00
7
03-48-58
2021-10-01 11:44:00
There are 90 different ID's and a few thousand rows in total. What I want to do is:
Group the entries by the ID's
Order those ID rows by the Date
Then calculate the difference between one timestamp to another
And create a column that has those entries (to then visualize it for the 90 different ID's)
While I thought it would be an easy thing to use the function groupby, I am having quite a bit of trouble. Would appreciate any input as to how to start this! Thank you!

You can do it this way:
>>> df.groupby("ID")["Date"].apply(lambda x: x.sort_values().diff())
ID Index
01-10 6 NaT
2 65 days 17:54:00
01-40-50 1 NaT
4 5 days 02:40:00
03-48-58 7 NaT
3 35 days 05:10:00
5 33 days 19:20:00

can you re-sample a series without dates?

I have a time series from months 1 to 420 (35 years). I would like to convert to an annual series using the average of the 12 months in each year so I can put in a dataframe I have with annual datapoints. I have it setup using a range with steps of 12 but it gets kind of messy. Ideally would like to use the resample function but having trouble since no dates. Any way around this?

There's no need to resample in this case. Just use groupby with integer division to obtain the average over the years.
import numpy as np
import pandas as pd
# Sample Data
np.random.seed(123)
df = pd.DataFrame({'Months': np.arange(1,421,1),
'val': np.random.randint(1,10,420)})
# Create Yearly average. 1-12, 13-24, Subtract 1 before // to get this grouping
df.groupby((df.Months-1)//12).val.mean().reset_index().rename(columns={'Months': 'Year'})
Outputs:
Year val
0 0 3.083333
1 1 4.166667
2 2 5.250000
3 3 4.416667
4 4 5.500000
5 5 4.583333
...
31 31 5.333333
32 32 5.000000
33 33 6.250000
34 34 5.250000
Feel free to add 1 to the year column or whatever you need to make it consistent with indexing in your other annual df. Otherwise, you could just use df.groupby((df.Months+11)//12).val().mean() to get the Year to start at 1.

Excel How to make a formula differenciate different vehicle plates

I have little knowlage of excel and I'm trying to configure an excel table so I can get the consumption of gas for each vehicle in a company, but all the data is introduced in only one table, how can I calculate the increase of km's of each vehicle to then be able to calculate the consumption?
The problem is that I don't know how to make the formula differenciate for each different plate.
The table is the following:
**A B C D E F G**
**1** Date Plate km Gas Signed Increased km's Consum
**2** 1/1/2018 0157-AAA 123456 50 YES
**3** 5/1/2018 0157-AAA 123789 20 NO
**4** 8/2/2018 0157-AAA 123987 30 NO
**5** 1/2/2018 0582-BBB 123456 40 YES
**6** 1/3/2018 0356-CCC 123456 30 NO
Another exemple:
Data Plate km Gas Increased km Consum %
3/5/2017 1111-AAA 150 20 150 13,33333333
7/5/2017 1111-AAA 400 30 250 12
7/5/2017 2222-BBB 50 10 50 20
7/5/2017 3333-CCC 20 5 20 25
10/5/2017 2222-BBB 200 30 150 20
Each plate is a different vehicle
Gas is the amount of oil that the vehicle refills in L
The table is updated daily or every 2-3 days as it's manually filled
The problem is calculating the increased km's as they may be other plates in between in the same date.
Consum % = Gas/Increased km *100
I thought about just ordering the columns by date and by plate and apply a general formula to everything
Thanks

I think I finally "solved my problem", the formula with the one I work is based on a filter for the plates in order to get them ordered. then the formula is:
Increased km =IF(B2=B1;C2-C1;C2)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Fastest, most efficient way to aggregate a large dataset in python - python-3.x

You can try with following packages which are used for speeding up pandas operation https://github.com/jmcarpenter2/swifter https://github.com/modin-project/modin

Related

Computing 10d rolling average on a descending date column in pandas [duplicate]

Pandas : Finding correct time window

How to group by an Attribute and calculate time between consecutive tickets for that Attribute

can you re-sample a series without dates?

Excel How to make a formula differenciate different vehicle plates

Categories

Resources