How to get all indexes which had a particular value in last row of a Pandas DataFrame? - python-3.x

For a sample DataFrame like,
>>> import pandas as pd
>>> index = pd.date_range(start='1/1/2018', periods=6, freq='15T')
>>> data = ['ON_PEAK', 'OFF_PEAK', 'ON_PEAK', 'ON_PEAK', 'OFF_PEAK', 'OFF_PEAK']
>>> df = pd.DataFrame(data, index=index, columns=['tou'])
>>> df
tou
2018-01-01 00:00:00 ON PEAK
2018-01-01 00:15:00 OFF PEAK
2018-01-01 00:30:00 ON PEAK
2018-01-01 00:45:00 ON PEAK
2018-01-01 01:00:00 OFF PEAK
2018-01-01 01:15:00 OFF PEAK
How to get all indexes for which tou value is not ON_PEAK but of row before them is ON_PEAK, i.e. the output would be:
['2018-01-01 00:15:00', '2018-01-01 01:00:00']
Or, if it's easier to get all rows with ON_PEAK and the first row next to them, i.e
['2018-01-01 00:00:00', '2018-01-01 00:15:00', '2018-01-01 00:30:00', '2018-01-01 00:45:00', '2018-01-01 01:00:00']

You need to find rows where tou is not ON_PEAK and the previous tou found using pandas.shift() is ON_PEAK. Note that positive values in shift give nth previous values and negative values gives nth next value in the dataframe.
df.loc[(df['tou']!='ON_PEAK') & (df['tou'].shift(1)=='ON_PEAK')]
Output:
tou
2018-01-01 00:15:00 OFF_PEAK
2018-01-01 01:00:00 OFF_PEAK

Related

Copy a single row value and apply it as a column using Pandas

My dataset looks like this:
# df1 - minute based dataset
date Open
2018-01-01 00:00:00 1.0536
2018-01-01 00:01:00 1.0527
2018-01-01 00:02:00 1.0558
2018-01-01 00:03:00 1.0534
2018-01-01 00:04:00 1.0524
...
What I want to do is get the value at 05:00:00 daily and create a new column called, OpenVal_5AM and put that corresponding value on that column. The new df will look like this:
# df2 - minute based dataset with 05:00:00 Open value
date Open OpenVal_5AM
2018-01-01 00:00:00 1.0536 1.0133
2018-01-01 00:01:00 1.0527 1.0133
2018-01-01 00:02:00 1.0558 1.0133
2018-01-01 00:03:00 1.0534 1.0133
2018-01-01 00:04:00 1.0524 1.0133
...
Since this is a minute based data, we will have 1440 same data point in the new column OpenVal_5AM for each day. It is because we are just grabbing the value at a point in time in a day and creating a new column.
What did I do?
I used this step:
df['OpenVal_5AM'] = df.groupby(df.date.dt.date,sort=False).Open.dt.hour.between(5, 5)
That's the closest I could come but it does not work.
Here's my suggestion:
df['OpenVal_5AM'] = df.apply(lambda r: df.Open.loc[r.name.replace(minute=5)], axis=1)
Disclaimer: I didn't test it with a huge dataset; so I don't know how it'll perform in that situation.

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Adding Minutes to Pandas DatetimeIndex

I have an index that contain dates.
DatetimeIndex(['2004-01-02', '2004-01-05', '2004-01-06', '2004-01-07',
'2004-01-08', '2004-01-09', '2004-01-12', '2004-01-13',
'2004-01-14', '2004-01-15',
...
'2015-12-17', '2015-12-18', '2015-12-21', '2015-12-22',
'2015-12-23', '2015-12-24', '2015-12-28', '2015-12-29',
'2015-12-30', '2015-12-31'],
dtype='datetime64[ns]', length=3021, freq=None)
Now for each day I would like to generate every minute (24*60=1440 minutes) within each day and make an index with all days and minutes.
The result should look like:
['2004-01-02 00:00:00', '2004-01-02 00:01:00', ..., '2004-01-02 23:59:00',
'2004-01-03 00:00:00', '2004-01-03 00:01:00', ..., '2004-01-03 23:59:00',
...
'2015-12-31 00:00:00', '2015-12-31 00:01:00', ..., '2015-12-31 23:59:00']
Is there a smart trick for this?
You should be able to use .asfreq() here:
>>> import pandas as pd
>>> days = pd.date_range(start='2018-01-01', days=10)
>>> df = pd.DataFrame(list(range(len(days))), index=days)
>>> df.asfreq('min')
0
2018-01-01 00:00:00 0.0
2018-01-01 00:01:00 NaN
2018-01-01 00:02:00 NaN
2018-01-01 00:03:00 NaN
2018-01-01 00:04:00 NaN
2018-01-01 00:05:00 NaN
2018-01-01 00:06:00 NaN
# ...
>>> df.shape
(10, 1)
>>> df.asfreq('min').shape
(12961, 1)
If that doesn't work for some reason, you might also want to have a look into pd.MultiIndex.from_product(); then pd.to_datetime() on the concatenated result.

Concatenate MultiIndex Dataframes in Python

I have 2 Multindex dataframes (date and ticker are indexes)
df_usdtbtc:
close
date ticker
2017-12-31 USDT_BTC 13769
2018-01-01 USDT_BTC 13351
and df_ethbtc:
close
date ticker
2017-12-31 USDT_ETH 736
2018-01-01 USDT_ETH 754
Is there any way to merge, concat or join these 2 dataframes to get as a result this dataframe :
close
date ticker
2017-12-31 USDT_BTC 13769
USDT_ETH 736
2018-01-01 USDT_BTC 13351
USDT_ETH 754
To help set up the dataframes :
df_usdtbtc = {'dates': [dtm(2018, 1, 1),dtm(2018, 1, 2)], 'ticker': ['USDT_BTC', 'USDT_BTC'],'close':[13769,13351]}
df_usdteth = {'dates': [dtm(2018, 1, 1),dtm(2018, 1, 2)], 'ticker': ['USDT_ETH', 'USDT_ETH'],'close':[736,754]}
df_usdtbtc = pd.DataFrame(data=df_usdtbtc)
df_usdtbtc=df_usdtbtc.set_index(['dates','ticker'])
df_usdteth = pd.DataFrame(data=df_usdteth)
df_usdteth=df_usdteth.set_index(['dates','ticker'])
Use concat or DataFrame.append with sort_index:
df = pd.concat([df_usdtbtc, df_ethbtc]).sort_index()
Or:
df = df_usdtbtc.append(df_ethbtc).sort_index()
df = pd.concat([df_usdtbtc, df_ethbtc]).sort_index()
print (df)
close
date ticker
2017-12-31 USDT_BTC 13769
USDT_ETH 736
2018-01-01 USDT_BTC 13351
USDT_ETH 754

Create a list of duplicate index entries in pandas dataframe

I am trying to identify which time stamps in my index have duplicates. I want to create a list of the time stamp strings. I would like to return a single timestamp for each of the time stamps that have duplicates if possible.
#required packages
import os
import pandas as pd
import numpy as np
import datetime
# create sample time series
header = ['A','B','C','D','E']
period = 5
cols = len(header)
dates = pd.date_range('1/1/2000', periods=period, freq='10min')
dates2 = pd.date_range('1/1/2022', periods=period, freq='10min')
df = pd.DataFrame(np.random.randn(period,cols),index=dates,columns=header)
df0 = pd.DataFrame(np.random.randn(period,cols),index=dates2,columns=header)
df1 = pd.concat([df]*3) #creates duplicate entries by copying the dataframe
df1 = pd.concat([df1, df0])
df2 = df1.sample(frac=1) #shuffles the dataframe
df3 = df1.sort_index() #sorts the dataframe by index
print(df2)
#print(df3)
# Identifying duplicated entries
df4 = df2.duplicated()
print(df4)
I would like to then use the list call out all the duplicate entries for each time stamp. From the code above, is there a good way to call the index that correlates to a bool type that is false?
Edit: added an extra dataframe to create some unique values and tripled the first data frame to create more than a single repeat.Also added more detail to the question.
IIUC:
df4[~df4]
Output:
2000-01-01 00:10:00 False
2000-01-01 00:00:00 False
2000-01-01 00:40:00 False
2000-01-01 00:30:00 False
2000-01-01 00:20:00 False
dtype: bool
List of timestamps,
df4[~df4].index.tolist()
Output:
[Timestamp('2000-01-01 00:10:00'),
Timestamp('2000-01-01 00:00:00'),
Timestamp('2000-01-01 00:40:00'),
Timestamp('2000-01-01 00:30:00'),
Timestamp('2000-01-01 00:20:00')]
In [46]: df2.drop_duplicates()
Out[46]:
A B C D E
2000-01-01 00:00:00 0.932587 -1.508587 -0.385396 -0.692379 2.083672
2000-01-01 00:40:00 0.237324 -0.321555 -0.448842 -0.983459 0.834747
2000-01-01 00:20:00 1.624815 -0.571193 1.951832 -0.642217 1.744168
2000-01-01 00:30:00 0.079106 -1.290473 2.635966 1.390648 0.206017
2000-01-01 00:10:00 0.760976 0.643825 -1.855477 -1.172241 0.532051
In [47]: df2.drop_duplicates().index.tolist()
Out[47]:
[Timestamp('2000-01-01 00:00:00'),
Timestamp('2000-01-01 00:40:00'),
Timestamp('2000-01-01 00:20:00'),
Timestamp('2000-01-01 00:30:00'),
Timestamp('2000-01-01 00:10:00')]

Resources