Pandas dataframe groupby by day and find first value that exceeds value at fixed time - python-3.x

I have a datetime indexed dataframe with several years of intraday data, in 2 minute increments. I want to group by day and include the first row that exceeds the price at 06:30:00 in each day.
df:
Price
2009-10-12 06:30:00 904
2009-10-12 06:32:00 904
2009-10-12 06:34:00 904.5
2009-10-12 06:36:00 905
2009-10-12 06:38:00 905.5
2009-10-13 06:30:00 901
2009-10-13 06:32:00 901
2009-10-13 06:34:00 901
2009-10-13 06:36:00 902
2009-10-13 06:38:00 903
I've tried using .groupby and .apply with a lambda function to group by day and include all rows that exceed the value at 06:30:00, but get an error.
onh = pd.to_datetime('6:30:00').time()
onhBreak = df.groupby(df.index.date).apply(lambda x: x[x > x.loc[onh]])
ValueError: Can only compare identically-labeled Series objects
Desired output:
Price
2009-10-12 06:34:00 904.5
2009-10-13 06:36:00 902
*If these rows are values in a groupby, that would be good also
Any help is appreciated.

Here we need groupby with idxmax
df = df.to_frame('value')
df['check'] = df.index.time>onh
subdf = df.loc[df.groupby(df.index.date)['check'].idxmax()]
Out[237]:
value check
2009-10-12 00:00:00 900.0 False
2020-05-29 13:08:00 3052.0 True
subdf = subdf[subdf['check']]

We can do:
mask_date = df['Date'].dt.time.gt(pd.to_datetime('06:30:00').time())
df_filtered = df.loc[mask_date.groupby(df['Date'].dt.date).idxmax()]
print(df_filtered)
Output
Date Value
1 2009-10-12 06:32:00 904.0
6 2009-10-13 06:32:00 901.0

Related

Key Error while subsetting Timeseries data using index

I have the following Timeseries data.
price_per_year.head()
price
date
2013-01-02 20.08
2013-01-03 19.78
2013-01-04 19.86
2013-01-07 19.40
2013-01-08 19.66
price_per_year.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 782 entries, 2013-01-02 to 2015-12-31
Data columns (total 1 columns):
price 756 non-null float64
dtypes: float64(1)
memory usage: 12.2 KB
I am trying to extract data for 3 years using the below code. Why is that I am getting KeyError: '2014', when the data as shown below clearly contains year '2014'. Appreciate any inputs.
price_per_year['2014'].head()
price
date
2014-01-01 NaN
2014-01-02 39.59
2014-01-03 40.12
2014-01-06 39.93
2014-01-07 40.92
prices = pd.DataFrame()
for year in ['2013', '2014', '2015']:
price_per_year = price_per_year.loc[year, ['price']].reset_index(drop=True)
price_per_year.rename(columns={'price': year}, inplace=True)
prices = pd.concat([prices, price_per_year], axis=1)
KeyError: '2014'
The code line price_per_year.loc['2014', ['price']], when used independently outside for loop, works fine, while price_per_year['price'][year] when used in the for loop doesn't work.
for year in ['2013', '2014', '2015']:
price_per_year = price_per_year['price'][year].reset_index(drop=True)
KeyError: 'price'
Both the code lines price_per_year.loc[price_per_year.index.year == 2014, ['price']] when used independently outside for loop and price_per_year.loc[price_per_year.index.year == year, ['price']] used inside the for loop are giving errors.
for year in ['2013', '2014', '2015']:
price_per_year.loc[price_per_year.index.year == '2014', ['price']].reset_index(drop=True)
TypeError: Cannot convert input [False] of type <class 'bool'> to Timestamp
Here is problem in your first code is used partial string indexing, here is used DataFrame.loc
prices = pd.DataFrame()
for year in ['2013', '2014', '2015']:
s = price_per_year['price'][year].reset_index(drop=True).rename(year)
prices = pd.concat([prices, s], axis=1)
print (prices)
2013 2014 2015
0 20.08 19.86 19.66
1 19.78 19.40 19.66
Another better solution with reshape:
print (df)
price
date
2013-01-02 20.08
2013-01-03 19.78
2014-01-02 19.86
2014-01-03 19.40
2015-01-02 19.66
2015-01-03 19.66
y = df.index.year
df = df.set_index([df.groupby(y).cumcount(), y])['price'].unstack()
print (df)
date 2013 2014 2015
0 20.08 19.86 19.66
1 19.78 19.40 19.66

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

Adjust the overlapping dates in group by with priority from another columns

As Title Suggest, I am working on a problem to find overlapping dates based on ID and adjust overlapping date based on priority(weight). Following piece of code helped to find overlapping dates.
df['overlap'] = (df.groupby('ID')
.apply(lambda x: (x['End_date'].shift() - x['Start_date']) > timedelta(0))
.reset_index(level=0, drop=True))
df
Now issue I'm facing is, how to introduce priority(weight) and adjust start_date by that. In the below image, I have highlighted adjusted dates based on weight where A takes precedence over B and B takes over C.
Should I create a dictionary for string to numeric weight values and then what? I'm stuck here to set up logic.
Dataframe:
op_d = {'ID': [1,1,1,2,2,3,3,3],'Start_date':['9/1/2020','10/10/2020','11/18/2020','4/1/2015','5/12/2016','4/1/2015','5/15/2016','8/1/2018'],\
'End_date':['10/9/2020','11/25/2020','12/31/2020','5/31/2016','12/31/2016','5/29/2016','9/25/2018','10/15/2020'],\
'Weight':['A','B','C','A','B','A','B','C']}
df = pd.DataFrame(data=op_d)
You have already identified the overlap condition, you can then try adding a day to End_Date and shift, then assign them to start date where overlap column is true:
arr = np.where(df['overlap'],df['End_date'].add(pd.Timedelta(1,unit='d')).shift(),
df['Start_date'])
out = df.assign(Output_Start_Date = arr,Output_End_Date=df['End_date'])
print(out)
ID Start_date End_date Weight overlap Output_Start_Date Output_End_Date
0 1 2020-09-01 2020-10-09 A False 2020-09-01 2020-10-09
1 1 2020-10-10 2020-11-25 B False 2020-10-10 2020-11-25
2 1 2020-11-18 2020-12-31 C True 2020-11-26 2020-12-31
3 2 2015-04-01 2016-05-31 A False 2015-04-01 2016-05-31
4 2 2016-05-12 2016-12-31 B True 2016-06-01 2016-12-31
5 3 2015-04-01 2016-05-29 A False 2015-04-01 2016-05-29
6 3 2016-05-15 2018-09-25 B True 2016-05-30 2018-09-25
7 3 2018-08-01 2020-10-15 C True 2018-09-26 2020-10-15

how to take only maximum date value is there are two date in a week in dataframe

i have a dataframe called Data
Date Value Frequency
06/01/2020 256 A
07/01/2020 235 A
14/01/2020 85 Q
16/01/2020 625 Q
22/01/2020 125 Q
here it is observed that 6/01/2020 and 07/01/2020 are in the same week that is monday and tuesday.
Therefore i wanted to take maximum date from week.
my final dataframe should look like this
Date Value Frequency
07/01/2020 235 A
16/01/2020 625 Q
22/01/2020 125 Q
I want the maximum date from the week , like i have showed in my final dataframe example.
I am new to python, And i am searching answer for this which i didnt find till now ,Please help
First convert column to datetimes by to_datetime and use DataFrameGroupBy.idxmax for rows with maximum datetime per rows with Series.dt.strftime, last select rows by DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
print (df['Date'].dt.strftime('%Y-%U'))
0 2020-01
1 2020-01
2 2020-02
3 2020-02
4 2020-03
Name: Date, dtype: object
df = df.loc[df.groupby(df['Date'].dt.strftime('%Y-%U'))['Date'].idxmax()]
print (df)
Date Value Frequency
1 2020-01-07 235 A
3 2020-01-16 625 Q
4 2020-01-22 125 Q
If format of datetimes cannot be changed:
d = pd.to_datetime(df['Date'], dayfirst=True)
df = df.loc[d.groupby(d.dt.strftime('%Y-%U')).idxmax()]
print (df)
Date Value Frequency
1 07/01/2020 235 A
3 16/01/2020 625 Q
4 22/01/2020 125 Q

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?
Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values
I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Resources