Pandas dataframe groupby by day and find first value that exceeds value at fixed time

Pandas dataframe groupby by day and find first value that exceeds value at fixed time - python-3.x

I have a datetime indexed dataframe with several years of intraday data, in 2 minute increments. I want to group by day and include the first row that exceeds the price at 06:30:00 in each day.
df:
Price
2009-10-12 06:30:00 904
2009-10-12 06:32:00 904
2009-10-12 06:34:00 904.5
2009-10-12 06:36:00 905
2009-10-12 06:38:00 905.5
2009-10-13 06:30:00 901
2009-10-13 06:32:00 901
2009-10-13 06:34:00 901
2009-10-13 06:36:00 902
2009-10-13 06:38:00 903
I've tried using .groupby and .apply with a lambda function to group by day and include all rows that exceed the value at 06:30:00, but get an error.
onh = pd.to_datetime('6:30:00').time()
onhBreak = df.groupby(df.index.date).apply(lambda x: x[x > x.loc[onh]])
ValueError: Can only compare identically-labeled Series objects
Desired output:
Price
2009-10-12 06:34:00 904.5
2009-10-13 06:36:00 902
*If these rows are values in a groupby, that would be good also
Any help is appreciated.

Here we need groupby with idxmax
df = df.to_frame('value')
df['check'] = df.index.time>onh
subdf = df.loc[df.groupby(df.index.date)['check'].idxmax()]
Out[237]:
value check
2009-10-12 00:00:00 900.0 False
2020-05-29 13:08:00 3052.0 True
subdf = subdf[subdf['check']]

We can do:
mask_date = df['Date'].dt.time.gt(pd.to_datetime('06:30:00').time())
df_filtered = df.loc[mask_date.groupby(df['Date'].dt.date).idxmax()]
print(df_filtered)
Output
Date Value
1 2009-10-12 06:32:00 904.0
6 2009-10-13 06:32:00 901.0

Related

Key Error while subsetting Timeseries data using index

I have the following Timeseries data.
price_per_year.head()
price
date
2013-01-02 20.08
2013-01-03 19.78
2013-01-04 19.86
2013-01-07 19.40
2013-01-08 19.66
price_per_year.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 782 entries, 2013-01-02 to 2015-12-31
Data columns (total 1 columns):
price 756 non-null float64
dtypes: float64(1)
memory usage: 12.2 KB
I am trying to extract data for 3 years using the below code. Why is that I am getting KeyError: '2014', when the data as shown below clearly contains year '2014'. Appreciate any inputs.
price_per_year['2014'].head()
price
date
2014-01-01 NaN
2014-01-02 39.59
2014-01-03 40.12
2014-01-06 39.93
2014-01-07 40.92
prices = pd.DataFrame()
for year in ['2013', '2014', '2015']:
price_per_year = price_per_year.loc[year, ['price']].reset_index(drop=True)
price_per_year.rename(columns={'price': year}, inplace=True)
prices = pd.concat([prices, price_per_year], axis=1)
KeyError: '2014'
The code line price_per_year.loc['2014', ['price']], when used independently outside for loop, works fine, while price_per_year['price'][year] when used in the for loop doesn't work.
for year in ['2013', '2014', '2015']:
price_per_year = price_per_year['price'][year].reset_index(drop=True)
KeyError: 'price'
Both the code lines price_per_year.loc[price_per_year.index.year == 2014, ['price']] when used independently outside for loop and price_per_year.loc[price_per_year.index.year == year, ['price']] used inside the for loop are giving errors.
for year in ['2013', '2014', '2015']:
price_per_year.loc[price_per_year.index.year == '2014', ['price']].reset_index(drop=True)
TypeError: Cannot convert input [False] of type <class 'bool'> to Timestamp

Here is problem in your first code is used partial string indexing, here is used DataFrame.loc
prices = pd.DataFrame()
for year in ['2013', '2014', '2015']:
s = price_per_year['price'][year].reset_index(drop=True).rename(year)
prices = pd.concat([prices, s], axis=1)
print (prices)
2013 2014 2015
0 20.08 19.86 19.66
1 19.78 19.40 19.66
Another better solution with reshape:
print (df)
price
date
2013-01-02 20.08
2013-01-03 19.78
2014-01-02 19.86
2014-01-03 19.40
2015-01-02 19.66
2015-01-03 19.66
y = df.index.year
df = df.set_index([df.groupby(y).cumcount(), y])['price'].unstack()
print (df)
date 2013 2014 2015
0 20.08 19.86 19.66
1 19.78 19.40 19.66

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?

You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()

You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120

I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120

Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

Adjust the overlapping dates in group by with priority from another columns

As Title Suggest, I am working on a problem to find overlapping dates based on ID and adjust overlapping date based on priority(weight). Following piece of code helped to find overlapping dates.
df['overlap'] = (df.groupby('ID')
.apply(lambda x: (x['End_date'].shift() - x['Start_date']) > timedelta(0))
.reset_index(level=0, drop=True))
df
Now issue I'm facing is, how to introduce priority(weight) and adjust start_date by that. In the below image, I have highlighted adjusted dates based on weight where A takes precedence over B and B takes over C.
Should I create a dictionary for string to numeric weight values and then what? I'm stuck here to set up logic.
Dataframe:
op_d = {'ID': [1,1,1,2,2,3,3,3],'Start_date':['9/1/2020','10/10/2020','11/18/2020','4/1/2015','5/12/2016','4/1/2015','5/15/2016','8/1/2018'],\
'End_date':['10/9/2020','11/25/2020','12/31/2020','5/31/2016','12/31/2016','5/29/2016','9/25/2018','10/15/2020'],\
'Weight':['A','B','C','A','B','A','B','C']}
df = pd.DataFrame(data=op_d)

You have already identified the overlap condition, you can then try adding a day to End_Date and shift, then assign them to start date where overlap column is true:
arr = np.where(df['overlap'],df['End_date'].add(pd.Timedelta(1,unit='d')).shift(),
df['Start_date'])
out = df.assign(Output_Start_Date = arr,Output_End_Date=df['End_date'])
print(out)
ID Start_date End_date Weight overlap Output_Start_Date Output_End_Date
0 1 2020-09-01 2020-10-09 A False 2020-09-01 2020-10-09
1 1 2020-10-10 2020-11-25 B False 2020-10-10 2020-11-25
2 1 2020-11-18 2020-12-31 C True 2020-11-26 2020-12-31
3 2 2015-04-01 2016-05-31 A False 2015-04-01 2016-05-31
4 2 2016-05-12 2016-12-31 B True 2016-06-01 2016-12-31
5 3 2015-04-01 2016-05-29 A False 2015-04-01 2016-05-29
6 3 2016-05-15 2018-09-25 B True 2016-05-30 2018-09-25
7 3 2018-08-01 2020-10-15 C True 2018-09-26 2020-10-15

how to take only maximum date value is there are two date in a week in dataframe

i have a dataframe called Data
Date Value Frequency
06/01/2020 256 A
07/01/2020 235 A
14/01/2020 85 Q
16/01/2020 625 Q
22/01/2020 125 Q
here it is observed that 6/01/2020 and 07/01/2020 are in the same week that is monday and tuesday.
Therefore i wanted to take maximum date from week.
my final dataframe should look like this
Date Value Frequency
07/01/2020 235 A
16/01/2020 625 Q
22/01/2020 125 Q
I want the maximum date from the week , like i have showed in my final dataframe example.
I am new to python, And i am searching answer for this which i didnt find till now ,Please help

First convert column to datetimes by to_datetime and use DataFrameGroupBy.idxmax for rows with maximum datetime per rows with Series.dt.strftime, last select rows by DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
print (df['Date'].dt.strftime('%Y-%U'))
0 2020-01
1 2020-01
2 2020-02
3 2020-02
4 2020-03
Name: Date, dtype: object
df = df.loc[df.groupby(df['Date'].dt.strftime('%Y-%U'))['Date'].idxmax()]
print (df)
Date Value Frequency
1 2020-01-07 235 A
3 2020-01-16 625 Q
4 2020-01-22 125 Q
If format of datetimes cannot be changed:
d = pd.to_datetime(df['Date'], dayfirst=True)
df = df.loc[d.groupby(d.dt.strftime('%Y-%U')).idxmax()]
print (df)
Date Value Frequency
1 07/01/2020 235 A
3 16/01/2020 625 Q
4 22/01/2020 125 Q

Create a pandas column based on a lookup value from another dataframe

I have a pandas dataframe that has some data values by hour (which is also the index of this lookup dataframe). The dataframe looks like this:
In [1] print (df_lookup)
Out[1] 0 1.109248
1 1.102435
2 1.085014
3 1.073487
4 1.079385
5 1.088759
6 1.044708
7 0.902482
8 0.852348
9 0.995912
10 1.031643
11 1.023458
12 1.006961
...
23 0.889541
I want to multiply the values from this lookup dataframe to create a column of another dataframe, which has datetime as index.
The dataframe looks like this:
In [2] print (df)
Out[2]
Date_Label ID data-1 data-2 data-3
2015-08-09 00:00:00 1 2513.0 2502 NaN
2015-08-09 00:00:00 1 2113.0 2102 NaN
2015-08-09 01:00:00 2 2006.0 1988 NaN
2015-08-09 02:00:00 3 2016.0 2003 NaN
...
2018-07-19 23:00:00 33 3216.0 333 NaN
I want to calculate the data-3 column from data-2 column, where the weight given to 'data-2' column depends on corresponding value in df_lookup. I get the desired values by looping over the index as follows, but that is too slow:
for idx in df.index:
df.loc[idx,'data-3'] = df.loc[idx, 'data-2']*df_lookup.at[idx.hour]
Is there a faster way someone could suggest?

Using .loc
df['data-2']*df_lookup.loc[df.index.hour].values
Out[275]:
Date_Label
2015-08-09 00:00:00 2775.338496
2015-08-09 00:00:00 2331.639296
2015-08-09 01:00:00 2191.640780
2015-08-09 02:00:00 2173.283042
Name: data-2, dtype: float64
#df['data-3']=df['data-2']*df_lookup.loc[df.index.hour].values

I'd probably try doing a join.
# Fix column name
df_lookup.columns = ['multiplier']
# Get hour index
df['hour'] = df.index.hour
# Join
df = df.join(df_lookup, how='left', on=['hour'])
df['data-3'] = df['data-2'] * df['multiplier']
df = df.drop(['multiplier', 'hour'], axis=1)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas dataframe groupby by day and find first value that exceeds value at fixed time - python-3.x

Here we need groupby with idxmax df = df.to_frame('value') df['check'] = df.index.time>onh subdf = df.loc[df.groupby(df.index.date)['check'].idxmax()] Out[237]: value check 2009-10-12 00:00:00 900.0 False 2020-05-29 13:08:00 3052.0 True subdf = subdf[subdf['check']]

We can do: mask_date = df['Date'].dt.time.gt(pd.to_datetime('06:30:00').time()) df_filtered = df.loc[mask_date.groupby(df['Date'].dt.date).idxmax()] print(df_filtered) Output Date Value 1 2009-10-12 06:32:00 904.0 6 2009-10-13 06:32:00 901.0

Related

Key Error while subsetting Timeseries data using index

Get the last date before an nth date for each month in Python

Adjust the overlapping dates in group by with priority from another columns

how to take only maximum date value is there are two date in a week in dataframe

Create a pandas column based on a lookup value from another dataframe

Categories

Resources