Concatenate MultiIndex Dataframes in Python - python-3.x

I have 2 Multindex dataframes (date and ticker are indexes)
df_usdtbtc:
close
date ticker
2017-12-31 USDT_BTC 13769
2018-01-01 USDT_BTC 13351
and df_ethbtc:
close
date ticker
2017-12-31 USDT_ETH 736
2018-01-01 USDT_ETH 754
Is there any way to merge, concat or join these 2 dataframes to get as a result this dataframe :
close
date ticker
2017-12-31 USDT_BTC 13769
USDT_ETH 736
2018-01-01 USDT_BTC 13351
USDT_ETH 754
To help set up the dataframes :
df_usdtbtc = {'dates': [dtm(2018, 1, 1),dtm(2018, 1, 2)], 'ticker': ['USDT_BTC', 'USDT_BTC'],'close':[13769,13351]}
df_usdteth = {'dates': [dtm(2018, 1, 1),dtm(2018, 1, 2)], 'ticker': ['USDT_ETH', 'USDT_ETH'],'close':[736,754]}
df_usdtbtc = pd.DataFrame(data=df_usdtbtc)
df_usdtbtc=df_usdtbtc.set_index(['dates','ticker'])
df_usdteth = pd.DataFrame(data=df_usdteth)
df_usdteth=df_usdteth.set_index(['dates','ticker'])

Use concat or DataFrame.append with sort_index:
df = pd.concat([df_usdtbtc, df_ethbtc]).sort_index()
Or:
df = df_usdtbtc.append(df_ethbtc).sort_index()
df = pd.concat([df_usdtbtc, df_ethbtc]).sort_index()
print (df)
close
date ticker
2017-12-31 USDT_BTC 13769
USDT_ETH 736
2018-01-01 USDT_BTC 13351
USDT_ETH 754

Related

Calculating weighted sum over different time series using pd.concat

I have multiple time seires who has different data density and different length, and I want to calculate the weighted sum. For example, df1, df2 and df3:
Out[467]:
datetime_doy
2017-01-01 0.308632
2017-01-02 0.307647
2017-01-03 0.306493
2017-01-04 0.292955
2017-01-10 0.369009
2019-12-27 0.387553
2019-12-28 0.383481
2019-12-29 0.382838
2019-12-30 0.379383
2019-12-31 0.379172
Name: df1, Length: 1055, dtype: float64
datetime_doy
2017-01-01 0.310446
2017-01-02 0.309330
2017-01-03 0.308632
2017-01-04 0.306234
2017-01-10 0.317367
2019-12-27 0.387510
2019-12-28 0.383549
2019-12-29 0.382762
2019-12-30 0.379483
2019-12-31 0.379078
Name: df2, Length: 1042, dtype: float64
datetime_doy
2017-01-01 0.302718
2017-01-02 0.301939
2017-01-03 0.301440
2017-01-04 0.300281
2017-01-05 0.299731
2017-08-27 0.227604
2017-08-28 0.227431
2017-08-30 0.227167
2017-08-31 0.237400
2017-09-01 0.243424
Name: df3, Length: 227, dtype: float64
I know that if I want to calculate the mean, I can just use pd.concat([df1, df2, df3],axis=1).mean(axis=1)like
pd.concat([df1, df2, df3],axis=1).mean(axis=1)
Out[475]:
datetime_doy
2017-01-01 0.307265
2017-01-02 0.306305
2017-01-03 0.305522
2017-01-04 0.299823
2017-01-05 0.299731
2019-12-27 0.387532
2019-12-28 0.383515
2019-12-29 0.382800
2019-12-30 0.379433
2019-12-31 0.379125
Length: 1065, dtype: float64
but what if I want to calculate the weighted average of df1, df2 and df3? say, weight is 0.1, 0.2 and 0.3. On time t, if there are values of df1 and df2, then new values is (0.1*df1.iloc[t] + 0.2*df2.iloc[t])/(0.1+0.2). If on time t, there are values of df1, df2 and df3, then it is (0.1*df1.iloc[t] + 0.2*df2.iloc[t] + 0.3*df3.iloc[t])/(0.1+0.2+0.3). If there is no value for all dataframes, then it's just simply np.nan (note that df3 only have data in 2017).
So how can I get it? Thanks!
I have found a solution to your problem by creating a separated pd.DataFrame for the weights. This way, you can have the sum of values for each day and the sum of weights for each day separated. I have created an example to illustrate my point:
a = ["2022-12-01", "2022-12-02", "2022-12-03", "2022-12-04", "2022-12-05"]
b = ["2022-12-03", "2022-12-04", "2022-12-05", "2022-12-06", "2022-12-07"]
c = ["2022-12-05", "2022-12-06", "2022-12-07", "2022-12-08", "2022-12-09"]
WEIGHT1 = 0.1
WEIGHT2 = 0.2
WEIGHT3 = 0.3
df1 = pd.DataFrame(data = np.random.normal(size=5), index=a, columns=["a"])
df2 = pd.DataFrame(data = np.random.normal(size=5), index=b, columns=["b"])
df3 = pd.DataFrame(data = np.random.normal(size=5), index=c, columns=["c"])
I have defined the above dates for my dataframes and weights following your example. As you pointed in your question, we have dates that belong to all three dataframes, that belong to only two or that are unique to a df. I have also filled the values with random values.
df1_weight = pd.DataFrame(data = WEIGHT1, index=df1.index, columns=["weight1"])
df2_weight = pd.DataFrame(data = WEIGHT2, index=df2.index, columns=["weight2"])
df3_weight = pd.DataFrame(data = WEIGHT3, index=df3.index, columns=["weight3"])
pd.concat([df1*WEIGHT1, df2*WEIGHT2, df3*WEIGHT3], axis=1).sum(axis=1).rename("sum_values").to_frame().join(pd.concat([df1_weight, df2_weight, df3_weight], axis=1).sum(axis=1).rename("sum_weights"))
My proposed solution consists in creating three dataframes, one for each weight and concat them as you did in the question. With the last line I concat all the values and all the weight and I add them for each day, this way you only need to divide both columns to obtain the desired values.
Hope it helps!

how to get employee count by Hour and Date using pySpark / python?

I have employee id, their clock in, and clock out timings by day. I want to calculate number of employee present in office by hour by Date.
Example Data
import pandas as pd
data1 = {'emp_id': ['Employee 1', 'Employee 2', 'Employee 3', 'Employee 4', 'Employee 5'],
'Clockin': ['12/5/2021 0:08','8/7/2021 0:04','3/30/2021 1:24','12/23/2021 22:45', '12/23/2021 23:29'],
'Clockout': ['12/5/2021 3:28','8/7/2021 0:34','3/30/2021 4:37','12/24/2021 0:42', '12/24/2021 1:42']}
df1 = pd.DataFrame(data1)
Example of output
import pandas as pd
data2 = {'Date': ['12/5/2021', '8/7/2021', '3/30/2021','3/30/2021','3/30/2021','3/30/2021', '12/23/2021','12/23/2021','12/24/2021','12/24/2021'],
'Hour': ['01:00','01:00','02:00','03:00','04:00','05:00', '22:00','23:00', '01:00','02:00'],
'emp_count': [1,1,1,1,1,1,1,2, 2,1]}
df2 = pd.DataFrame(data2)
Try this:
# Round clock in DOWN to the nearest PRECEDING hour
clock_in = pd.to_datetime(df1["Clockin"]).dt.floor("H")
# Round clock out UP to the nearest SUCCEEDING hour
clock_out = pd.to_datetime(df1["Clockout"]).dt.ceil("H")
# Generate time series at hourly frequency between adjusted clock in and clock
# out time
hours = pd.Series(
[
pd.date_range(in_, out_, freq="H", inclusive="right")
for in_, out_ in zip(clock_in, clock_out)
]
).explode()
# Final result
hours.groupby(hours).count()
Result:
2021-03-30 02:00:00 1
2021-03-30 03:00:00 1
2021-03-30 04:00:00 1
2021-03-30 05:00:00 1
2021-08-07 01:00:00 1
2021-12-05 01:00:00 1
2021-12-05 02:00:00 1
2021-12-05 03:00:00 1
2021-12-05 04:00:00 1
2021-12-23 23:00:00 1
2021-12-24 00:00:00 2
2021-12-24 01:00:00 2
2021-12-24 02:00:00 1
dtype: int64
It's slightly different from your expected output but consistent with your business rules.

Adding Minutes to Pandas DatetimeIndex

I have an index that contain dates.
DatetimeIndex(['2004-01-02', '2004-01-05', '2004-01-06', '2004-01-07',
'2004-01-08', '2004-01-09', '2004-01-12', '2004-01-13',
'2004-01-14', '2004-01-15',
...
'2015-12-17', '2015-12-18', '2015-12-21', '2015-12-22',
'2015-12-23', '2015-12-24', '2015-12-28', '2015-12-29',
'2015-12-30', '2015-12-31'],
dtype='datetime64[ns]', length=3021, freq=None)
Now for each day I would like to generate every minute (24*60=1440 minutes) within each day and make an index with all days and minutes.
The result should look like:
['2004-01-02 00:00:00', '2004-01-02 00:01:00', ..., '2004-01-02 23:59:00',
'2004-01-03 00:00:00', '2004-01-03 00:01:00', ..., '2004-01-03 23:59:00',
...
'2015-12-31 00:00:00', '2015-12-31 00:01:00', ..., '2015-12-31 23:59:00']
Is there a smart trick for this?
You should be able to use .asfreq() here:
>>> import pandas as pd
>>> days = pd.date_range(start='2018-01-01', days=10)
>>> df = pd.DataFrame(list(range(len(days))), index=days)
>>> df.asfreq('min')
0
2018-01-01 00:00:00 0.0
2018-01-01 00:01:00 NaN
2018-01-01 00:02:00 NaN
2018-01-01 00:03:00 NaN
2018-01-01 00:04:00 NaN
2018-01-01 00:05:00 NaN
2018-01-01 00:06:00 NaN
# ...
>>> df.shape
(10, 1)
>>> df.asfreq('min').shape
(12961, 1)
If that doesn't work for some reason, you might also want to have a look into pd.MultiIndex.from_product(); then pd.to_datetime() on the concatenated result.

How to get all indexes which had a particular value in last row of a Pandas DataFrame?

For a sample DataFrame like,
>>> import pandas as pd
>>> index = pd.date_range(start='1/1/2018', periods=6, freq='15T')
>>> data = ['ON_PEAK', 'OFF_PEAK', 'ON_PEAK', 'ON_PEAK', 'OFF_PEAK', 'OFF_PEAK']
>>> df = pd.DataFrame(data, index=index, columns=['tou'])
>>> df
tou
2018-01-01 00:00:00 ON PEAK
2018-01-01 00:15:00 OFF PEAK
2018-01-01 00:30:00 ON PEAK
2018-01-01 00:45:00 ON PEAK
2018-01-01 01:00:00 OFF PEAK
2018-01-01 01:15:00 OFF PEAK
How to get all indexes for which tou value is not ON_PEAK but of row before them is ON_PEAK, i.e. the output would be:
['2018-01-01 00:15:00', '2018-01-01 01:00:00']
Or, if it's easier to get all rows with ON_PEAK and the first row next to them, i.e
['2018-01-01 00:00:00', '2018-01-01 00:15:00', '2018-01-01 00:30:00', '2018-01-01 00:45:00', '2018-01-01 01:00:00']
You need to find rows where tou is not ON_PEAK and the previous tou found using pandas.shift() is ON_PEAK. Note that positive values in shift give nth previous values and negative values gives nth next value in the dataframe.
df.loc[(df['tou']!='ON_PEAK') & (df['tou'].shift(1)=='ON_PEAK')]
Output:
tou
2018-01-01 00:15:00 OFF_PEAK
2018-01-01 01:00:00 OFF_PEAK

roll off profile stacking data frames

I have a dataframe that looks like:
import pandas as pd
import datetime as dt
df= pd.DataFrame({'date':['2017-12-31','2017-12-31'],'type':['Asset','Liab'],'Amount':[100,-100],'Maturity Date':['2019-01-02','2018-01-01']})
df
I am trying to build a roll-off profile by checking if the 'Maturity Date' is greater than a 'date' in the future. I am trying to achieve something like:
#First Month
df1=df[df['Maturity Date']>'2018-01-31']
df1['date']='2018-01-31'
#Second Month
df2=df[df['Maturity Date']>'2018-02-28']
df2['date']='2018-02-28'
#third Month
df3=df[df['Maturity Date']>'2018-03-31']
df3['date']='2018-02-31'
#first quarter
qf1=df[df['Maturity Date']>'2018-06-30']
qf1['date']='2018-06-30'
#concatenate
df=pd.concat([df,df1,df2,df3,qf1])
df
I was wondering if there is a way to :
Allow an arbitrary long number of dates without repeating code
I think you need numpy.tile for repeat indices and assign to new column, last filter by boolean indexing and sorting by sort_values:
d = '2017-12-31'
df['Maturity Date'] = pd.to_datetime(df['Maturity Date'])
#generate first month and next quarters
c1 = pd.date_range(d, periods=4, freq='M')
c2 = pd.date_range(c1[-1], periods=2, freq='Q')
#join together
c = c1.union(c2[1:])
#repeat rows be indexing repeated index
df1 = df.loc[np.tile(df.index, len(c))].copy()
#assign column by datetimes
df1['date'] = np.repeat(c, len(df))
#filter by boolean indexing
df1 = df1[df1['Maturity Date'] > df1['date']]
print (df1)
Amount Maturity Date date type
0 100 2019-01-02 2017-12-31 Asset
1 -100 2018-01-01 2017-12-31 Liab
0 100 2019-01-02 2018-01-31 Asset
0 100 2019-01-02 2018-02-28 Asset
0 100 2019-01-02 2018-03-31 Asset
0 100 2019-01-02 2018-06-30 Asset
You could use a nifty tool in the Pandas arsenal called
pd.merge_asof. It
works similarly to pd.merge, except that it matches on "nearest" keys rather
than equal keys. Furthermore, you can tell pd.merge_asof to look for nearest
keys in only the backward or forward direction.
To make things interesting (and help check that things are working properly), let's add another row to df:
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
print(df)
# Amount Maturity Date date type
# 1 200 2018-03-15 2017-12-31 Asset
# 0 100 2019-01-02 2017-12-31 Asset
Now define some new dates:
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
# date
# 0 2018-01-31
# 1 2018-02-28
# 2 2018-03-31
# 3 2018-06-30
Now we can merge rows, matching nearest dates from result with Maturity Dates from df:
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
In this case we want to "match" dates with Maturity Dates which are greater
so we use direction='forward'.
Putting it all together:
import pandas as pd
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
result = pd.concat([df, result], axis=0)
result = result.sort_values(by=['Maturity Date', 'date'])
print(result)
yields
Amount Maturity Date date type
1 200 2018-03-15 2017-12-31 Asset
0 200 2018-03-15 2018-01-31 Asset
1 200 2018-03-15 2018-02-28 Asset
0 100 2019-01-02 2017-12-31 Asset
2 100 2019-01-02 2018-03-31 Asset
3 100 2019-01-02 2018-06-30 Asset

Resources