Calculating weighted sum over different time series using pd.concat - python-3.x

I have multiple time seires who has different data density and different length, and I want to calculate the weighted sum. For example, df1, df2 and df3:
Out[467]:
datetime_doy
2017-01-01 0.308632
2017-01-02 0.307647
2017-01-03 0.306493
2017-01-04 0.292955
2017-01-10 0.369009
2019-12-27 0.387553
2019-12-28 0.383481
2019-12-29 0.382838
2019-12-30 0.379383
2019-12-31 0.379172
Name: df1, Length: 1055, dtype: float64
datetime_doy
2017-01-01 0.310446
2017-01-02 0.309330
2017-01-03 0.308632
2017-01-04 0.306234
2017-01-10 0.317367
2019-12-27 0.387510
2019-12-28 0.383549
2019-12-29 0.382762
2019-12-30 0.379483
2019-12-31 0.379078
Name: df2, Length: 1042, dtype: float64
datetime_doy
2017-01-01 0.302718
2017-01-02 0.301939
2017-01-03 0.301440
2017-01-04 0.300281
2017-01-05 0.299731
2017-08-27 0.227604
2017-08-28 0.227431
2017-08-30 0.227167
2017-08-31 0.237400
2017-09-01 0.243424
Name: df3, Length: 227, dtype: float64
I know that if I want to calculate the mean, I can just use pd.concat([df1, df2, df3],axis=1).mean(axis=1)like
pd.concat([df1, df2, df3],axis=1).mean(axis=1)
Out[475]:
datetime_doy
2017-01-01 0.307265
2017-01-02 0.306305
2017-01-03 0.305522
2017-01-04 0.299823
2017-01-05 0.299731
2019-12-27 0.387532
2019-12-28 0.383515
2019-12-29 0.382800
2019-12-30 0.379433
2019-12-31 0.379125
Length: 1065, dtype: float64
but what if I want to calculate the weighted average of df1, df2 and df3? say, weight is 0.1, 0.2 and 0.3. On time t, if there are values of df1 and df2, then new values is (0.1*df1.iloc[t] + 0.2*df2.iloc[t])/(0.1+0.2). If on time t, there are values of df1, df2 and df3, then it is (0.1*df1.iloc[t] + 0.2*df2.iloc[t] + 0.3*df3.iloc[t])/(0.1+0.2+0.3). If there is no value for all dataframes, then it's just simply np.nan (note that df3 only have data in 2017).
So how can I get it? Thanks!

I have found a solution to your problem by creating a separated pd.DataFrame for the weights. This way, you can have the sum of values for each day and the sum of weights for each day separated. I have created an example to illustrate my point:
a = ["2022-12-01", "2022-12-02", "2022-12-03", "2022-12-04", "2022-12-05"]
b = ["2022-12-03", "2022-12-04", "2022-12-05", "2022-12-06", "2022-12-07"]
c = ["2022-12-05", "2022-12-06", "2022-12-07", "2022-12-08", "2022-12-09"]
WEIGHT1 = 0.1
WEIGHT2 = 0.2
WEIGHT3 = 0.3
df1 = pd.DataFrame(data = np.random.normal(size=5), index=a, columns=["a"])
df2 = pd.DataFrame(data = np.random.normal(size=5), index=b, columns=["b"])
df3 = pd.DataFrame(data = np.random.normal(size=5), index=c, columns=["c"])
I have defined the above dates for my dataframes and weights following your example. As you pointed in your question, we have dates that belong to all three dataframes, that belong to only two or that are unique to a df. I have also filled the values with random values.
df1_weight = pd.DataFrame(data = WEIGHT1, index=df1.index, columns=["weight1"])
df2_weight = pd.DataFrame(data = WEIGHT2, index=df2.index, columns=["weight2"])
df3_weight = pd.DataFrame(data = WEIGHT3, index=df3.index, columns=["weight3"])
pd.concat([df1*WEIGHT1, df2*WEIGHT2, df3*WEIGHT3], axis=1).sum(axis=1).rename("sum_values").to_frame().join(pd.concat([df1_weight, df2_weight, df3_weight], axis=1).sum(axis=1).rename("sum_weights"))
My proposed solution consists in creating three dataframes, one for each weight and concat them as you did in the question. With the last line I concat all the values and all the weight and I add them for each day, this way you only need to divide both columns to obtain the desired values.
Hope it helps!

Related

ValueError: cannot reindex from a duplicate axis while shift one column in Pandas

Given a dataframe df with date index as follows:
value
2017-03-31 NaN
2017-04-01 27863.7
2017-04-02 27278.5
2017-04-03 27278.5
2017-04-04 27278.5
...
2021-10-27 NaN
2021-10-28 NaN
2021-10-29 NaN
2021-10-30 NaN
2021-10-31 NaN
I'm able to shift value column by one year use df['value'].shift(freq=pd.DateOffset(years=1)):
Out:
2018-03-31 NaN
2018-04-01 27863.7
2018-04-02 27278.5
2018-04-03 27278.5
2018-04-04 27278.5
...
2022-10-27 NaN
2022-10-28 NaN
2022-10-29 NaN
2022-10-30 NaN
2022-10-31 NaN
But when I use it to replace orginal value by df['value'] = df['value'].shift(freq=pd.DateOffset(years=1)), it raises an error:
ValueError: cannot reindex from a duplicate axis
Since the code below works smoothly, so I think the issue caused by NaNs in value column:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130101', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
df
df.B = df.B.shift(freq=pd.DateOffset(years=1))
I also try with df['value'].shift(freq=relativedelta(years=+1)), but it generates: pandas.errors.NullFrequencyError: Cannot shift with no freq
Someone could help to deal with this issue? Sincere thanks.
Since the code below works smoothly, so I think the issue caused by NaNs in value column
No I don't think so. It's probably because in your 2nd sample you have only 1 leap year.
Reproducible error with 2 leap years:
# 2018 (366 days), 2019 (365 days) and 2020 (366 days)
dates = pd.date_range('20180101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
...
ValueError: cannot reindex from a duplicate axis
...
The example below works:
# 2017 (365 days), 2018 (366 days) and 2019 (365 days)
dates = pd.date_range('20170101', periods=365*3+1)
df = pd.DataFrame(np.random.randint(0, 100, size=(365*3+1, 3)),
index=dates, columns=list('ABC'))
df.B = df.B.shift(freq=pd.DateOffset(years=1))
Just look to value_counts:
# 2018 -> 2020
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2021-02-28 2 # The duplicated index
2020-12-29 1
2021-01-04 1
2021-01-03 1
2021-01-02 1
..
2020-01-07 1
2020-01-08 1
2020-01-09 1
2020-01-10 1
2021-12-31 1
Length: 1095, dtype: int64
# 2017 -> 2019
>>> df.B.shift(freq=pd.DateOffset(years=1)).index.value_counts()
2018-01-01 1
2019-12-30 1
2020-01-05 1
2020-01-04 1
2020-01-03 1
..
2019-01-07 1
2019-01-08 1
2019-01-09 1
2019-01-10 1
2021-01-01 1
Length: 1096, dtype: int64
Solution
Obviously, the solution is to remove duplicated index, in our case '2021-02-28', by using resample('D') and an aggregate function first, last, min, max, mean, sum or a custom one:
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28']
2021-02-28 41
2021-02-28 96
Name: B, dtype: int64
>>> df.B.shift(freq=pd.DateOffset(years=1))['2021-02-28'] \
.resample('D').agg(('first', 'last', 'min', 'max', 'mean', 'sum')).T
2021-02-28
first 41.0
last 96.0
min 41.0
max 96.0
mean 68.5
sum 137.0
# Choose `last` for example
df.B = df.B.shift(freq=pd.DateOffset(years=1)).resample('D').last()
Note, you can replace .resample(...).func by .loc[lambda x: x.index.duplicated()]

Merge converting timestamp to scientific notation and losing precision

Original timestamp dtype int64
ts = datetime.fromtimestamp(1627741304932/1000)
print(ts)
2021-07-31 17:21:44.932000
After merging dataframes the timestamp loses/gains +-5 minutes and dtype turns to float64
ts = datetime.fromtimestamp(1.627741e+12/1000)
print(ts)
2021-07-31 17:16:40
Is there a way to avoid this kind of convertion or at least the precision loss?
except from dropping a trillion+ and returning it after merging?
UPDATE
I've created an exact example of my issue:
Example
df1 = pd.DataFrame({'col1': ['ts1', 'ts2', 'ts3', 'ts4'],
'col2': [1627741304932, 1627741304931, 1627741304930, 1627741304929]})
df2 = pd.DataFrame({'col1': ['ts1', 'ts2', 'ts3', 'ts5'],
'col2': [1627741305932, 1627741304931, 1627741304930, 1627741304920]})
x = df1.merge(df2, on='col1', how='outer', suffixes=('_prev', '_new'))
print(x)
print(x.dtypes)
Output
It's happening because of the NaN value that are added to the to the dataframe during the merge
col1 col2_prev col2_new
0 ts1 1.627741e+12 1.627741e+12
1 ts2 1.627741e+12 1.627741e+12
2 ts3 1.627741e+12 1.627741e+12
3 ts4 1.627741e+12 NaN
4 ts5 NaN 1.627741e+12
col1 object
col2_prev float64
col2_new float64
dtype: object
How can I get around this?
So it seems that the problem boils down to pandas converting the timestamps from int to float. This is because the 'int64' data type does not support NaN values.
To overcome this, we can use Nullable integer data types:
e.g:
df1 = pd.DataFrame({'col1': ['ts1', 'ts2', 'ts3', 'ts4'],
'col2': [1627741304932, 1627741304931, 1627741304930, 1627741304929]})
df2 = pd.DataFrame({'col1': ['ts1', 'ts2', 'ts3', 'ts5'],
'col2': [1627741305932, 1627741304931, 1627741304930, 1627741304920]})
# allow NaN values (notice the capital I)
df1['col2'] =df1['col2'].astype('Int64')
df2['col2'] =df2['col2'].astype('Int64')
x = df1.merge(df2, on='col1', how='outer', suffixes=('_prev', '_new'))
print(x)
print(x.dtypes)
Output:
col1 col2_prev col2_new
0 ts1 1627741304932 1627741305932
1 ts2 1627741304931 1627741304931
2 ts3 1627741304930 1627741304930
3 ts4 1627741304929 <NA>
4 ts5 <NA> 1627741304920
col1 object
col2_prev Int64
col2_new Int64
dtype: object

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Resampling with a Multiindex and several columns

I have a pandas dataframe with the following structure:
ID date m_1 m_2
1 2016-01-03 10 3.4
2016-02-07 11 3.3
2016-02-07 10.4 2.8
2 2016-01-01 10.9 2.5
2016-02-04 12 2.3
2016-02-04 11 2.7
2016-02-04 12.1 2.1
Both ID and date are a MultiIndex. The data represent some measurements made by some sensors (in the example two sensors). Those sensors sometimes create several measurements per day (as shown in the example).
My questions are:
How can I resample this so I have one row per day per sensor, but one column with the mean, another with the max another with min, etc?
How can I "align" (maybe this is no the correct word) the two time series, so both begin and end at the same time (from 2016-01-01 to 2016-02-07) adding the missing days with NAs?
You can use groupby with DataFrameGroupBy.resample and aggregate by functions in dict first and then reindex by MultiIndex.from_product:
df = df.reset_index(level=0).groupby('ID').resample('D').agg({'m_1':'mean', 'm_2':'max'})
df = df.reindex(pd.MultiIndex.from_product(df.index.levels, names = df.index.names))
#alternative for adding missing start and end datetimes
#df = df.unstack().stack(dropna=False)
print (df.head())
m_2 m_1
ID date
1 2016-01-01 NaN NaN
2016-01-02 NaN NaN
2016-01-03 3.4 10.0
2016-01-04 NaN NaN
2016-01-05 NaN NaN
For PeriodIndex in second level use set_levels with to_period:
df.index = df.index.set_levels(df.index.get_level_values('date').to_period('d'), level=1)
print (df.index.get_level_values('date'))
PeriodIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
'2016-01-05', '2016-01-06', '2016-01-07', '2016-01-08',
'2016-01-09', '2016-01-10', '2016-01-11', '2016-01-12',
'2016-01-13', '2016-01-14', '2016-01-15', '2016-01-16',
'2016-01-17', '2016-01-18', '2016-01-19', '2016-01-20',
'2016-01-21', '2016-01-22', '2016-01-23', '2016-01-24',
'2016-01-25', '2016-01-26', '2016-01-27', '2016-01-28',
'2016-01-29', '2016-01-30', '2016-01-31', '2016-02-01',
'2016-02-02', '2016-02-03', '2016-02-04', '2016-02-05',
'2016-02-06', '2016-02-07', '2016-01-01', '2016-01-02',
'2016-01-03', '2016-01-04', '2016-01-05', '2016-01-06',
'2016-01-07', '2016-01-08', '2016-01-09', '2016-01-10',
'2016-01-11', '2016-01-12', '2016-01-13', '2016-01-14',
'2016-01-15', '2016-01-16', '2016-01-17', '2016-01-18',
'2016-01-19', '2016-01-20', '2016-01-21', '2016-01-22',
'2016-01-23', '2016-01-24', '2016-01-25', '2016-01-26',
'2016-01-27', '2016-01-28', '2016-01-29', '2016-01-30',
'2016-01-31', '2016-02-01', '2016-02-02', '2016-02-03',
'2016-02-04', '2016-02-05', '2016-02-06', '2016-02-07'],
dtype='period[D]', name='date', freq='D')

Sorting a dataframe by index

I have a dataframe called df that is indexed by date which I am trying to sort oldest date to newest.
I have tried to use both:
df = df.sort(axis=1)
and:
df = df.sort_index(axis=1)
but as you can see from the date order in the following df tail the dates have not been sorted into date order.
wood_density
date
2016-01-27 5.8821
2016-01-28 5.7760
2015-12-25 NaN
2016-01-01 NaN
2015-12-26 NaN
What can I try to resolve this?
Use sort_index to sort the index:
In [19]:
df = df.sort_index()
df
Out[19]:
wood_density
date
2015-12-25 NaN
2015-12-26 NaN
2016-01-01 NaN
2016-01-27 5.8821
2016-01-28 5.7760
sort, which is deprecated by sort_values or sort_index sorts on row labels by default axis=0 so it would've worked if you didn't pass this:
In [21]:
df.sort()
Out[21]:
wood_density
date
2015-12-25 NaN
2015-12-26 NaN
2016-01-01 NaN
2016-01-27 5.8821
2016-01-28 5.7760

Resources