Why would pandas groupby resampling be taking forever? - python-3.x

Pandas Verion: 1.5.3
Python Version: 3.9.13
I'm trying to resample a pandas dataframe of some time series data which is divided by id. I've seen a million examples of this online, and frankly, have used this technique myself many times. However, for some reason, I have a dataframe that is taking an extremely long time to resample my data, despite being a rather reasonable amount of rows (~250k).
The structure is very simple:
item_id
date
value
1
2023-01-01
1
1
2023-01-03
3
1
2023-01-05
5
2
2023-01-01
1
2
2023-01-03
3
2
2023-01-05
5
I've oversimplified, but this is the core idea. What I want to do is resample this, grouped by item_id, with a frequency of 'per day'. The resulting table should look like this after resampling...
item_id
date
value
1
2023-01-01
1
1
2023-01-02
NaN
1
2023-01-03
3
1
2023-01-04
NaN
1
2023-01-05
5
2
2023-01-01
1
2
2023-01-02
NaN
2
2023-01-03
3
2
2023-01-04
NaN
2
2023-01-05
5
Keep in mind, I'm looking for the resampling min/max to be within each item_id group, so for the sake of this question, I'm not looking for the stack/unstack method (however, that method does execute almost instantly on this dataset).
Assuming df is my dataframe before sampling, this would be the python code I would use...
# First clean types / set index
df = df.astype({
'item_id': 'str',
'value': 'int'
})
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df = df.set_index('date')
# Resample
df = df.groupby('item_id').resample('D').value.mean()
If I run this code on ~250k rows, it takes approximately 12min to execute. That's so long I MUST assume that I'm doing something wrong here... but for the life of me I cannot see it. Any suggestions?

Related

Sum values for a specific date within multiple dictionaries each containing multiple dataframes in python

I have two dictionaries, each containing two dataframes of varying sizes, but all of which share two columns in common: 'Date' and '# of Apples'
I'm looking to create a dataframe 'results_df' that contains two columns 'Date' and 'Sum of Apples', which checks each of the four dataframes within the two dictionaries for a matching date and sums the '# of apples' for that day, placing it in the 'Sum of Apples' column of results_df.
Example of data:
dict1 = {'df1':Dataframe, 'df2':Dataframe}
df1 = ['Date', '# of Apples']
2023-01-01 ... 5
2023-01-03 ... 2
df2 = ['Date', '# of Apples']
2023-01-01 ... 1
2023-01-04 ... 4
dict2 = {'df3':Dataframe, 'df4':Dataframe}
df3 = ['Date', '# of Apples']
2023-01-03 ... 2
2023-01-04 ... 5
df4 = ['Date', '# of Apples']
2023-01-01 ... 4
2023-01-03 ... 3
Trying to achieve:
results_df = ['Date', 'Sum of Apples']
2023-01-01 ... 10
2023-01-02 ... 0
2023-01-03 ... 7
2023-01-04 ... 9
2023-01-05 ... 0
...
I'm unsure how to access the dataframes within the dicts and match dates in order to get the sum using dataframes.
I looked into df concatenation but since the dataframes are of different size, I was having a lot of problems with that. I also tried merging each dict into a single dict and then created a loop to cycle through each dict and then each dataframe within each dict but that ended up being pretty slow, so I have a feeling that's not the correct way and that there's a better way that fully utilize the power of dataframes that I just don't understand.
Appreciate any help.
Edit: this is my first post here so if more information is needed, please let me know.
Based on your (I/O), the question can be interpreted in at least two ways (see below) :
​results_df = (
pd.concat([df_ for df_ in {**dict1, **dict2}.values()],
ignore_index=True)
.assign(Date= lambda df_: pd.to_datetime(df_["Date"]))
.groupby("Date").sum()
.set_axis(["Sum of Apples"], axis=1)
.pipe(lambda df_: df_.reindex(pd.date_range(df_.index.min(),
df_.index.max(),
freq="D"), fill_value=0))
.reset_index(names="Date")
)
​
NB : If not needed, remove the statements/lines when we call pandas.DataFrame.pipe.
Output :
print(results_df, type(results_df))
Date Sum of Apples
0 2023-01-01 10
1 2023-01-02 0
2 2023-01-03 7
3 2023-01-04 9 <class 'pandas.core.frame.DataFrame'>

Split up time series per year for plotting

I would like to plot a time series, start Oct-2015 and end Feb-2018, in one graph, each year is a single line. The time series is int64 value and is in a Pandas DataFrame. The date is in datetime64[ns] as one of the columns in the DataFrame.
How would I create a graph from Jan-Dez with 4 lines for each year.
graph['share_price'] and graph['date'] are used. I have tried Grouper, but that somehow takes Oct-2015 values and mixes it with the January values from all other years.
This groupby is close to what I want, but I loose the information which year the index of the list belongs to.
graph.groupby('date').agg({'share_price':lambda x: list(x)})
Then I have created a DataFrame with 4 columns, 1 for each year but still, I don't know how to go ahead and group these 4 columns in a way, that I will be able to plot a graph in a way I want.
You can achieve this by:
extracting the year from the date
replacing the dates by the equivalent without the year
setting both the year and the date as index
unstacking the values by year
At this point, each year will be a column, and each date within the year a row, so you can just plot normally.
Here's an example.
Assuming that your DataFrame looks something like this:
>>> import pandas as pd
>>> import numpy as np
>>> index = pd.date_range('2015-10-01', '2018-02-28')
>>> values = np.random.randint(-3, 4, len(index)).cumsum()
>>> df = pd.DataFrame({
... 'date': index,
... 'share_price': values
>>> })
>>> df.head()
date share_price
0 2015-10-01 0
1 2015-10-02 3
2 2015-10-03 2
3 2015-10-04 5
4 2015-10-05 4
>>> df.set_index('date').plot()
You would transform the DataFrame as follows:
>>> df['year'] = df.date.dt.year
>>> df['date'] = df.date.dt.strftime('%m-%d')
>>> unstacked = df.set_index(['year', 'date']).share_price.unstack(-2)
>>> unstacked.head()
year 2015 2016 2017 2018
date
01-01 NaN 28.0 -16.0 21.0
01-02 NaN 29.0 -14.0 22.0
01-03 NaN 29.0 -16.0 22.0
01-04 NaN 26.0 -15.0 23.0
01-05 NaN 25.0 -16.0 21.0
And just plot normally:
unstacked.plot()

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

can you re-sample a series without dates?

I have a time series from months 1 to 420 (35 years). I would like to convert to an annual series using the average of the 12 months in each year so I can put in a dataframe I have with annual datapoints. I have it setup using a range with steps of 12 but it gets kind of messy. Ideally would like to use the resample function but having trouble since no dates. Any way around this?
There's no need to resample in this case. Just use groupby with integer division to obtain the average over the years.
import numpy as np
import pandas as pd
# Sample Data
np.random.seed(123)
df = pd.DataFrame({'Months': np.arange(1,421,1),
'val': np.random.randint(1,10,420)})
# Create Yearly average. 1-12, 13-24, Subtract 1 before // to get this grouping
df.groupby((df.Months-1)//12).val.mean().reset_index().rename(columns={'Months': 'Year'})
Outputs:
Year val
0 0 3.083333
1 1 4.166667
2 2 5.250000
3 3 4.416667
4 4 5.500000
5 5 4.583333
...
31 31 5.333333
32 32 5.000000
33 33 6.250000
34 34 5.250000
Feel free to add 1 to the year column or whatever you need to make it consistent with indexing in your other annual df. Otherwise, you could just use df.groupby((df.Months+11)//12).val().mean() to get the Year to start at 1.

Resampling with a Multiindex and several columns

I have a pandas dataframe with the following structure:
ID date m_1 m_2
1 2016-01-03 10 3.4
2016-02-07 11 3.3
2016-02-07 10.4 2.8
2 2016-01-01 10.9 2.5
2016-02-04 12 2.3
2016-02-04 11 2.7
2016-02-04 12.1 2.1
Both ID and date are a MultiIndex. The data represent some measurements made by some sensors (in the example two sensors). Those sensors sometimes create several measurements per day (as shown in the example).
My questions are:
How can I resample this so I have one row per day per sensor, but one column with the mean, another with the max another with min, etc?
How can I "align" (maybe this is no the correct word) the two time series, so both begin and end at the same time (from 2016-01-01 to 2016-02-07) adding the missing days with NAs?
You can use groupby with DataFrameGroupBy.resample and aggregate by functions in dict first and then reindex by MultiIndex.from_product:
df = df.reset_index(level=0).groupby('ID').resample('D').agg({'m_1':'mean', 'm_2':'max'})
df = df.reindex(pd.MultiIndex.from_product(df.index.levels, names = df.index.names))
#alternative for adding missing start and end datetimes
#df = df.unstack().stack(dropna=False)
print (df.head())
m_2 m_1
ID date
1 2016-01-01 NaN NaN
2016-01-02 NaN NaN
2016-01-03 3.4 10.0
2016-01-04 NaN NaN
2016-01-05 NaN NaN
For PeriodIndex in second level use set_levels with to_period:
df.index = df.index.set_levels(df.index.get_level_values('date').to_period('d'), level=1)
print (df.index.get_level_values('date'))
PeriodIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
'2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
'2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
'2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
'2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
'2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
'2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01',
'2016-02-02', '2016-02-03', '2016-02-04', '2016-02-05',
'2016-02-06', '2016-02-07', '2016-01-01', '2016-01-02',
'2016-01-03', '2016-01-04', '2016-01-05', '2016-01-06',
'2016-01-07', '2016-01-08', '2016-01-09', '2016-01-10',
'2016-01-11', '2016-01-12', '2016-01-13', '2016-01-14',
'2016-01-15', '2016-01-16', '2016-01-17', '2016-01-18',
'2016-01-19', '2016-01-20', '2016-01-21', '2016-01-22',
'2016-01-23', '2016-01-24', '2016-01-25', '2016-01-26',
'2016-01-27', '2016-01-28', '2016-01-29', '2016-01-30',
'2016-01-31', '2016-02-01', '2016-02-02', '2016-02-03',
'2016-02-04', '2016-02-05', '2016-02-06', '2016-02-07'],
dtype='period[D]', name='date', freq='D')

Resources