Resampling with a Multiindex and several columns - python-3.x

I have a pandas dataframe with the following structure:
ID date m_1 m_2
1 2016-01-03 10 3.4
2016-02-07 11 3.3
2016-02-07 10.4 2.8
2 2016-01-01 10.9 2.5
2016-02-04 12 2.3
2016-02-04 11 2.7
2016-02-04 12.1 2.1
Both ID and date are a MultiIndex. The data represent some measurements made by some sensors (in the example two sensors). Those sensors sometimes create several measurements per day (as shown in the example).
My questions are:
How can I resample this so I have one row per day per sensor, but one column with the mean, another with the max another with min, etc?
How can I "align" (maybe this is no the correct word) the two time series, so both begin and end at the same time (from 2016-01-01 to 2016-02-07) adding the missing days with NAs?

You can use groupby with DataFrameGroupBy.resample and aggregate by functions in dict first and then reindex by MultiIndex.from_product:
df = df.reset_index(level=0).groupby('ID').resample('D').agg({'m_1':'mean', 'm_2':'max'})
df = df.reindex(pd.MultiIndex.from_product(df.index.levels, names = df.index.names))
#alternative for adding missing start and end datetimes
#df = df.unstack().stack(dropna=False)
print (df.head())
m_2 m_1
ID date
1 2016-01-01 NaN NaN
2016-01-02 NaN NaN
2016-01-03 3.4 10.0
2016-01-04 NaN NaN
2016-01-05 NaN NaN
For PeriodIndex in second level use set_levels with to_period:
df.index = df.index.set_levels(df.index.get_level_values('date').to_period('d'), level=1)
print (df.index.get_level_values('date'))
PeriodIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
'2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
'2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
'2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
'2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
'2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
'2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01',
'2016-02-02', '2016-02-03', '2016-02-04', '2016-02-05',
'2016-02-06', '2016-02-07', '2016-01-01', '2016-01-02',
'2016-01-03', '2016-01-04', '2016-01-05', '2016-01-06',
'2016-01-07', '2016-01-08', '2016-01-09', '2016-01-10',
'2016-01-11', '2016-01-12', '2016-01-13', '2016-01-14',
'2016-01-15', '2016-01-16', '2016-01-17', '2016-01-18',
'2016-01-19', '2016-01-20', '2016-01-21', '2016-01-22',
'2016-01-23', '2016-01-24', '2016-01-25', '2016-01-26',
'2016-01-27', '2016-01-28', '2016-01-29', '2016-01-30',
'2016-01-31', '2016-02-01', '2016-02-02', '2016-02-03',
'2016-02-04', '2016-02-05', '2016-02-06', '2016-02-07'],
dtype='period[D]', name='date', freq='D')

Related

Why would pandas groupby resampling be taking forever?

Pandas Verion: 1.5.3
Python Version: 3.9.13
I'm trying to resample a pandas dataframe of some time series data which is divided by id. I've seen a million examples of this online, and frankly, have used this technique myself many times. However, for some reason, I have a dataframe that is taking an extremely long time to resample my data, despite being a rather reasonable amount of rows (~250k).
The structure is very simple:
item_id
date
value
1
2023-01-01
1
1
2023-01-03
3
1
2023-01-05
5
2
2023-01-01
1
2
2023-01-03
3
2
2023-01-05
5
I've oversimplified, but this is the core idea. What I want to do is resample this, grouped by item_id, with a frequency of 'per day'. The resulting table should look like this after resampling...
item_id
date
value
1
2023-01-01
1
1
2023-01-02
NaN
1
2023-01-03
3
1
2023-01-04
NaN
1
2023-01-05
5
2
2023-01-01
1
2
2023-01-02
NaN
2
2023-01-03
3
2
2023-01-04
NaN
2
2023-01-05
5
Keep in mind, I'm looking for the resampling min/max to be within each item_id group, so for the sake of this question, I'm not looking for the stack/unstack method (however, that method does execute almost instantly on this dataset).
Assuming df is my dataframe before sampling, this would be the python code I would use...
# First clean types / set index
df = df.astype({
'item_id': 'str',
'value': 'int'
})
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df = df.set_index('date')
# Resample
df = df.groupby('item_id').resample('D').value.mean()
If I run this code on ~250k rows, it takes approximately 12min to execute. That's so long I MUST assume that I'm doing something wrong here... but for the life of me I cannot see it. Any suggestions?

Fill missing value in different columns of dataframe using mean or median of last n values

I have a dataframe which contains timeseries data. What i want to do is efficiently fill all the missing values in different columns by substituting with a median value using timedelta of say "N" mins. E.g if for a column say i have data for 10:20, 10:21,10:22,10:23,10:24,.... and data in 10:22 is missing then with timedelta of say 2 mins i would want it to be filled by median value of 10:20,10:21,10:23 and 10:24.
One way i can do is :
for all column in dataframe:
Find index which has nan value
for all index which has nan value:
extract all values using between_time with index-timedelta and index_+deltatime
find the media of extracted value
set value in the index with that extracted median value.
This looks like 2 for loops running and not a very efficient one. Is there a efficient way to do it.
Thanks
IIUC you can resample your time column, then fillna with rolling window set to center:
# dummy data setup
np.random.seed(500)
n = 2
df = pd.DataFrame({"time":pd.to_timedelta([f"10:{i}:00" for i in range(15)]),
"value":np.random.randint(2, 10, 15)})
df = df.drop(df.index[[5,10]]).reset_index(drop=True)
print (df)
time value
0 10:00:00 4
1 10:01:00 9
2 10:02:00 3
3 10:03:00 3
4 10:04:00 8
5 10:06:00 9
6 10:07:00 2
7 10:08:00 9
8 10:09:00 9
9 10:11:00 7
10 10:12:00 3
11 10:13:00 3
12 10:14:00 7
s = df.set_index("time").resample("60S").asfreq()
print (s.fillna(s.rolling(n*2+1, min_periods=1, center=True).mean()))
value
time
10:00:00 4.0
10:01:00 9.0
10:02:00 3.0
10:03:00 3.0
10:04:00 8.0
10:05:00 5.5
10:06:00 9.0
10:07:00 2.0
10:08:00 9.0
10:09:00 9.0
10:10:00 7.0
10:11:00 7.0
10:12:00 3.0
10:13:00 3.0
10:14:00 7.0

Split up time series per year for plotting

I would like to plot a time series, start Oct-2015 and end Feb-2018, in one graph, each year is a single line. The time series is int64 value and is in a Pandas DataFrame. The date is in datetime64[ns] as one of the columns in the DataFrame.
How would I create a graph from Jan-Dez with 4 lines for each year.
graph['share_price'] and graph['date'] are used. I have tried Grouper, but that somehow takes Oct-2015 values and mixes it with the January values from all other years.
This groupby is close to what I want, but I loose the information which year the index of the list belongs to.
graph.groupby('date').agg({'share_price':lambda x: list(x)})
Then I have created a DataFrame with 4 columns, 1 for each year but still, I don't know how to go ahead and group these 4 columns in a way, that I will be able to plot a graph in a way I want.
You can achieve this by:
extracting the year from the date
replacing the dates by the equivalent without the year
setting both the year and the date as index
unstacking the values by year
At this point, each year will be a column, and each date within the year a row, so you can just plot normally.
Here's an example.
Assuming that your DataFrame looks something like this:
>>> import pandas as pd
>>> import numpy as np
>>> index = pd.date_range('2015-10-01', '2018-02-28')
>>> values = np.random.randint(-3, 4, len(index)).cumsum()
>>> df = pd.DataFrame({
... 'date': index,
... 'share_price': values
>>> })
>>> df.head()
date share_price
0 2015-10-01 0
1 2015-10-02 3
2 2015-10-03 2
3 2015-10-04 5
4 2015-10-05 4
>>> df.set_index('date').plot()
You would transform the DataFrame as follows:
>>> df['year'] = df.date.dt.year
>>> df['date'] = df.date.dt.strftime('%m-%d')
>>> unstacked = df.set_index(['year', 'date']).share_price.unstack(-2)
>>> unstacked.head()
year 2015 2016 2017 2018
date
01-01 NaN 28.0 -16.0 21.0
01-02 NaN 29.0 -14.0 22.0
01-03 NaN 29.0 -16.0 22.0
01-04 NaN 26.0 -15.0 23.0
01-05 NaN 25.0 -16.0 21.0
And just plot normally:
unstacked.plot()

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Generate daterange and insert in a new column of a dataframe

Problem statement: Create a dataframe with multiple columns and populate one column with daterange series of 5 minute interval.
Tried solution:
Created a dataframe initially with just one row / 5 columns (all "NAN") .
Command used to generate daterange:
rf = pd.date_range('2000-1-1', periods=5, freq='5min').
O/P of rf :
DatetimeIndex(['2000-01-01 00:00:00', '2000-01-01 00:05:00',
'2000-01-01 00:10:00', '2000-01-01 00:15:00',
'2000-01-01 00:20:00'],
dtype='datetime64[ns]', freq='5T')
When I try to assign rf to one of the columns of df (df['column1'] = rf)., it is throwing exception as shown below (copying the last line of exception).
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2879, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
Though I understood the issue, I don't know the solution. I'm looking for a easy way to achieve this.
I think, I was slowly understanding the power/usage of dataframes.
Initially create a dataframe :
df = pd.DataFrame(index=range(100),columns=['A','B','C'])
Then created a date_range.
date = pd.date_range('2000-1-1', periods=100, freq='5T')
Using "assign" function , added date_range as new column to already created dataframe (df).
df = df.assign(D=date)
Final O/P of df:
df[:5]
A B C D
0 NaN NaN NaN 2000-01-01 00:00:00
1 NaN NaN NaN 2000-01-01 00:05:00
2 NaN NaN NaN 2000-01-01 00:10:00
3 NaN NaN NaN 2000-01-01 00:15:00
4 NaN NaN NaN 2000-01-01 00:20:00
Your dataframe has only one row and you try to insert data for five rows.

Resources