How to use the condition with other rows (previous moments in time series data), in pandas, python3 - python-3.x

I have a pandas.dataframe df. It is a time series data, with 1000 rows and 3 columns. What I want is given in the pseudo-code below.
for each row
if the value in column 'colA' at [this_row-1] is higher than
the value in column 'B' at [this_row-2] for more than 3%
then set the value in 'colCheck' at [this_row] as True.
Finally, pickout all the rows in the df where 'colCheck' are True.
I will use the following example to further demonstrate my purpose.
df =
'colA', 'colB', 'colCheck'
Dates
2017-01-01, 20, 30, NAN
2017-01-02, 10, 40, NAN
2017-01-03, 50, 20, False
2017-01-04, 40, 10, True
First, when this_row = 2 (the 3rd row, where the date is 2017-01-03), the value in colA at [this_row-1] is 10, the value in colB at [this_row-2] is 30. So (10-30)/30 = -67% < 3%, so the value in colCheck at [this_row] is False.
Likewise, when this_row = 3, (50-40)/40 = 25% > 3%, so the value in colCheck at [this_row] is True.
Last but not least, the first two rows in colCheck should be NAN, since the calculation needs to access [this_row-2] in colB. But the first two rows do not have [this_row-2].
Besides, the criteria of 3% and [row-1] in colA, [row-2] in colB are just examples. In my real project, they are situational, e.g. 4% and [row-3].
I am looking for concise and elegant approach. I am using Python3.
Thanks.

You can rearrange the maths and use pd.Series.shift
df.colA.shift(1).div(df.colB.shift(2)).gt(1.03)
Dates
2017-01-01 False
2017-01-02 False
2017-01-03 False
2017-01-04 True
dtype: bool
Using pd.DataFrame.assign we can create a copy with the new column
df.assign(colCheck=df.colA.shift(1).div(df.colB.shift(2)).gt(1.03))
colA colB colCheck
Dates
2017-01-01 20 30 False
2017-01-02 10 40 False
2017-01-03 50 20 False
2017-01-04 40 10 True
If you insisted on leaving the first two as NaN, you could use iloc
df.assign(colCheck=df.colA.shift(1).div(df.colB.shift(2)).gt(1.03).iloc[2:])
colA colB colCheck
Dates
2017-01-01 20 30 NaN
2017-01-02 10 40 NaN
2017-01-03 50 20 False
2017-01-04 40 10 True
And for maximum clarity:
# This creates a boolean array of when your conditions are met
colCheck = (df.colA.shift(1) / df.colB.shift(2)) > 1.03
# This chops off the first two `False` values and creates a new
# column named `colCheck` and assigns to it the boolean values
# calculate just above.
df.assign(colCheck=colCheck.iloc[2:])
colA colB colCheck
Dates
2017-01-01 20 30 NaN
2017-01-02 10 40 NaN
2017-01-03 50 20 False
2017-01-04 40 10 True

Related

How to reformat time series to fill in missing entries with NaNs?

I have a problem that involves converting time series from one
representation to another. Each item in the time series has
attributes "time", "id", and "value" (think of it as a measurement
at "time" for sensor "id"). I'm storing all the items in a
Pandas dataframe with columns named by the attributes.
The set of "time"s is a small set of integers (say, 32),
but some of the "id"s are missing "time"s/"value"s. What I want to
construct is an output dataframe with the form:
id time0 time1 ... timeN
val0 val1 ... valN
where the missing "value"s are represented by NaNs.
For example, suppose the input looks like the following:
time id value
0 0 13
2 0 15
3 0 20
2 1 10
3 1 12
Then, assuming the set of possible times is 0, 2, and 3, the
desired output is:
id time0 time1 time2 time3
0 13 NaN 15 20
1 NaN NaN 10 12
I'm looking for a Pythonic way to do this since there are several
million rows in the input and around 1/4 million groups.
You can transform your table with a pivot. If you need to handle duplicate values for index/column pairs, you can use the more general pivot_table.
For your example, the simple pivot is sufficient:
>>> df = df.pivot(index="id", columns="time", values="value")
time 0 2 3
id
0 13.0 15.0 20.0
1 NaN 10.0 12.0
To get the exact result from your question, you could reindex the columns to fill in the empty values, and rename the column index like this:
# add missing time columns, fill with NaNs
df = df.reindex(range(df.columns.max() + 1), axis=1)
# name them "time#"
df.columns = "time" + df.columns.astype(str)
# remove the column index name "time"
df = df.rename_axis(None, axis=1)
Final df:
time0 time1 time2 time3
id
0 13.0 NaN 15.0 20.0
1 NaN NaN 10.0 12.0

Pandas : Finding correct time window

I have a pandas dataframe which gets updated every hour with latest hourly data. I have to filter out IDs based upon a threshold, i.e. PR_Rate > 50 and CNT_12571 < 30 for 3 consecutive hours from a lookback period of 5 hours. I was using the below statements to accomplish this:
df_thld=df[(df['Date'] > df['Date'].max() - pd.Timedelta(hours=5))& (df.PR_Rate>50) & (df.CNT_12571 < 30)]
df_thld.loc[:,'HR_CNT'] = df_thld.groupby('ID')['Date'].nunique().to_frame('HR_CNT').reset_index()
df_thld[(df_thld['HR_CNT'] >3]
The problem with this approach is that since lookback period requirement is 5 hours, so, this HR_CNT can count any non consecutive hours breaching this critieria.
MY Dataset is as below:
DataFrame
Date IDs CT_12571 PR_Rate
16/06/2021 10:00 A1 15 50.487
16/06/2021 11:00 A1 31 40.806
16/06/2021 12:00 A1 25 52.302
16/06/2021 13:00 A1 13 61.45
16/06/2021 14:00 A1 7 73.805
In the above Dataframe, threshold was not breached at 1100 hrs, but while counting the hours, 10,12 and 13 as the hours that breached the threshold instead of 12,13,14 as required. Each id may or may not have this critieria breached in a single day. Any idea, How can I fix this issue?
Please excuse me, if I have misinterpreted your problem. As I understand the issues you have a dataframe which is updated hourly. An example of this dataframe is illustrated below as df. From this dataframe, you want to filter only those rows which satisfy the following two conditions:
PR_Rate > 50 and CNT_12571 < 30
If and only if the threshold is surpassed for three consecutive hours
Given these assumptions, I would proceed as follows:
df:
Date IDs CT_1257 PR_Rate
0 2021-06-16 10:00:00 A1 15 50.487
1 2021-06-16 12:00:00 A1 31 40.806
2 2021-06-16 14:00:00 A1 25 52.302
3 2021-06-16 15:00:00 A1 13 61.450
4 2021-06-16 16:00:00 A1 7 73.805
Note in this dataframe, the only time fr5ame which satisfies the above conditions is the entries for the of 14:00, 15:00 and 16:00.
def filterFrame(df, dur, pr_threshold, ct_threshold):
ff = df[(df['CT_1257']< ct_threshold) & (df['PR_Rate'] >pr_threshold) ].reset_index()
ml = list(ff.rolling(f'{dur}h', on='Date').count()['IDs'])
r = len(ml)- 1
rows= []
while r >= 0:
end = r
start = None
if int(ml[r]) < dur:
r -= 1
else:
k = int(ml[r])
for i in range(k):
rows.append(r-i)
r -= k
rows = rows[::-1]
return ff.filter(items= rows, axis = 0).reset_index()
running filterFrame(df, 3, 50, 30) yields:
level_0 index Date IDs CT_1257 PR_Rate
0 1 2 2021-06-16 14:00:00 A1 25 52.302
1 2 3 2021-06-16 15:00:00 A1 13 61.450
2 3 4 2021-06-16 16:00:00 A1 7 73.805

Find Matching rows in the data frame by comparing all rows based on certain conditions

I'm fairly new to python and would appreciate if someone can guide me in the right direction.
I have a dataset that has unique trades in each row. I need to find all rows that match on certain conditions. Basically, find any offsetting trades that fit a certain condition. For example:
Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other. I have attached the image of data.
Thank You.
You can use groupby to achieve this. As per you requirement specific to this ask Find trades that have the same REF_RATE, RECEIVE is within a difference of 5, MATURITY_DATE is with 7 days of each other you can proceed like this.
#sample data created from the image of your dataset
>>> data = {'Maturity_Date':['2/01/2021','10/01/2021','10/01/2021','6/06/2021'],'Trade_id':['10484','12880','11798','19561'],'REF_RATE':['BBSW','BBSW','OIS','BBSW'],'Recive':[1.5,1.25,2,10]}
>>> df = pd.DataFrame(data)
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2/01/2021 10484 BBSW 1.50
1 10/01/2021 12880 BBSW 1.25
2 10/01/2021 11798 OIS 2.00
3 6/06/2021 19561 BBSW 10.00
#convert Maturity_Date to datetime format and sort REF_RATE by date if needed
>>> df['Maturity_Date'] = pd.to_datetime(df['Maturity_Date'], dayfirst=True)
>>> df['Maturity_Date'] = df.groupby('REF_RATE')['Maturity_Date'].apply(lambda x: x.sort_values()) #if needed
>>> df
Maturity_Date Trade_id REF_RATE Recive
0 2021-01-02 10484 BBSW 1.50
1 2021-01-10 12880 BBSW 1.25
2 2021-01-10 11798 OIS 2.00
3 2021-06-06 19561 BBSW 10.00
#groupby of REF_RATE and apply condition on date and receive column
>>> df['date_diff>7'] = df.groupby('REF_RATE')['Maturity_Date'].diff() / np.timedelta64(1, 'D') > 7
>>> df['rate_diff>5'] = df.groupby('REF_RATE')['Recive'].diff() > 5
>>> df
Maturity_Date Trade_id REF_RATE Recive date_diff>7 rate_diff>5
0 2021-01-02 10484 BBSW 1.50 False False
1 2021-01-10 12880 BBSW 1.25 True False #date_diff true as for BBSW Maturity date is more than 7
2 2021-01-10 11798 OIS 2.00 False False
3 2021-06-06 19561 BBSW 10.00 True True #rate_diff and date_diff true because date>7 and receive difference>5

Boolean indexing in pandas dataframes

I'm trying to apply boolean indexing to a pandas DataFrame.
nm - stores the names of players
ag- stores the player ages
sc - stores the scores
capt - stores boolean index values
import pandas as pd
nm=pd.Series(['p1','p2', 'p3', 'p4'])
ag=pd.Series([12,17,14, 19])
sc=pd.Series([120, 130, 150, 100])
capt=[True, False, True, True]
Cricket=pd.DataFrame({"Name":nm,"Age":ag ,"Score":sc}, index=capt)
print(Cricket)
Output:
Name Age Score
True NaN NaN NaN
False NaN NaN NaN
True NaN NaN NaN
True NaN NaN NaN
Whenever I run the code above, I get a DataFrame filled with NaN values. The only case in which this seems to work is when capt doesn't have repeating elements.
i.e When capt=[False, True] (and reasonable values are given for nm, ag and sc) this code works as expected.
I'm running python 3.8.5, pandas 1.1.1 Is this a deprecated functionality?
Desired output:
Name Age Score
True p1 12 120
False p2 17 130
True p3 14 150
True p4 19 100
Set index values for each Series for avoid mismatch between default RangeIndex of each Series and new index values from capt:
capt=[True, False, True, True]
nm=pd.Series(['p1','p2', 'p3', 'p4'], index=capt)
ag=pd.Series([12,17,14, 19], index=capt)
sc=pd.Series([120, 130, 150, 100], index=capt)
Cricket=pd.DataFrame({"Name":nm,"Age":ag ,"Score":sc})
print(Cricket)
Name Age Score
True p1 12 120
False p2 17 130
True p3 14 150
True p4 19 100
Detail:
print(pd.Series(['p1','p2', 'p3', 'p4']))
0 p1
1 p2
2 p3
3 p4
dtype: object
print(pd.Series(['p1','p2', 'p3', 'p4'], index=capt))
True p1
False p2
True p3
True p4
dtype: object
Boolean indexing is filtration:
capt=[True, False, True, True]
nm=pd.Series(['p1','p2', 'p3', 'p4'])
ag=pd.Series([12,17,14, 19])
sc=pd.Series([120, 130, 150, 100])
Cricket=pd.DataFrame({"Name":nm,"Age":ag ,"Score":sc})
print(Cricket)
Name Age Score
0 p1 12 120
1 p2 17 130
2 p3 14 150
3 p4 19 100
print (Cricket[capt])
Name Age Score
0 p1 12 120
2 p3 14 150
3 p4 19 100

Adjust the overlapping dates in group by with priority from another columns

As Title Suggest, I am working on a problem to find overlapping dates based on ID and adjust overlapping date based on priority(weight). Following piece of code helped to find overlapping dates.
df['overlap'] = (df.groupby('ID')
.apply(lambda x: (x['End_date'].shift() - x['Start_date']) > timedelta(0))
.reset_index(level=0, drop=True))
df
Now issue I'm facing is, how to introduce priority(weight) and adjust start_date by that. In the below image, I have highlighted adjusted dates based on weight where A takes precedence over B and B takes over C.
Should I create a dictionary for string to numeric weight values and then what? I'm stuck here to set up logic.
Dataframe:
op_d = {'ID': [1,1,1,2,2,3,3,3],'Start_date':['9/1/2020','10/10/2020','11/18/2020','4/1/2015','5/12/2016','4/1/2015','5/15/2016','8/1/2018'],\
'End_date':['10/9/2020','11/25/2020','12/31/2020','5/31/2016','12/31/2016','5/29/2016','9/25/2018','10/15/2020'],\
'Weight':['A','B','C','A','B','A','B','C']}
df = pd.DataFrame(data=op_d)
You have already identified the overlap condition, you can then try adding a day to End_Date and shift, then assign them to start date where overlap column is true:
arr = np.where(df['overlap'],df['End_date'].add(pd.Timedelta(1,unit='d')).shift(),
df['Start_date'])
out = df.assign(Output_Start_Date = arr,Output_End_Date=df['End_date'])
print(out)
ID Start_date End_date Weight overlap Output_Start_Date Output_End_Date
0 1 2020-09-01 2020-10-09 A False 2020-09-01 2020-10-09
1 1 2020-10-10 2020-11-25 B False 2020-10-10 2020-11-25
2 1 2020-11-18 2020-12-31 C True 2020-11-26 2020-12-31
3 2 2015-04-01 2016-05-31 A False 2015-04-01 2016-05-31
4 2 2016-05-12 2016-12-31 B True 2016-06-01 2016-12-31
5 3 2015-04-01 2016-05-29 A False 2015-04-01 2016-05-29
6 3 2016-05-15 2018-09-25 B True 2016-05-30 2018-09-25
7 3 2018-08-01 2020-10-15 C True 2018-09-26 2020-10-15

Resources