I have a dataframe with a time series of scores. My goal is to detect when the score is larger than a certain threshold th and then to find when the score goes back to 0. Is quite easy to find each condition separately
dates_1 = score > th
dates_2 = np.sign(score[1:]) == np.sign(score.shift(1).dropna())
However, I don't know what's the most pythonic way to override dates_2 so that only dates when an 'active' date_1 has been observed
Perhaps using an auxiliary column 'active' set to 1 whenever score > th is True and set it to False when the condition for dates_2 is met. That way I can ask for the change in sign AND active == True. However, that approach requires iteration and I'm wondering if there's a vectorized solution to my problem
Any thoughts on how to improve my approach?
Sample data:
date score
2010-01-04 0.0
2010-01-05 -0.3667779798467592
2010-01-06 -1.9641427199568868
2010-01-07 -0.49976215445519134
2010-01-08 -0.7069108074548405
2010-01-11 -1.4624766212523337
2010-01-12 -0.9132777669357441
2010-01-13 0.16204588193577152
2010-01-14 0.958085568609925
2010-01-15 1.4683022129399834
2010-01-19 3.036016680985081
2010-01-20 2.2357911432637345
2010-01-21 2.8827438241030707
2010-01-22 -3.395977874791837
Expected Output
if th = 0.94
date active
2010-01-04 False
2010-01-05 False
2010-01-06 False
2010-01-07 False
2010-01-08 False
2010-01-11 False
2010-01-12 False
2010-01-13 False
2010-01-14 True
2010-01-15 True
2010-01-19 True
2010-01-20 True
2010-01-21 True
2010-01-22 False
Not Vectorized!
def alt_cond(s, th):
active = False
for x in s:
active = [x >= th, x > 0][int(active)]
yield active
df.assign(A=[*alt_cond(df.score, 0.94)])
date score A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
Vectorized (Sort Of)
I used Numba to really speed things up. Still a loop but should be very fast if you can install numba
from numba import njit
#njit
def alt_cond(s, th):
active = False
out = np.zeros(len(s), dtype=np.bool8)
for i, x in enumerate(s):
if active:
if x <= 0:
active = False
else:
if x >= th:
active = True
out[i] = active
return out
df.assign(A=alt_cond(df.score.values, .94))
Response to Comment
You can have a dictionary of column names and threshold values and iterate
th = {'score': 0.94}
df.join(pd.DataFrame(
np.column_stack([[*alt_cond(df[k], v)] for k, v in th.items()]),
df.index, [f"{k}_A" for k in th]
))
date score score_A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
I'm assuming your data is in a pandas dataframe, and 'date' is your index column. Then this would be the way I'd do it:
th = 0.94 # Threshold value
i = df[df.score>th].index[0] # Check the index for the first condition
df[i:][df.score<0].index[0] # Check the index for the second condition, after the index of the first condition
So use conditional indexing to find the index for the first condition ([df.score>th]), then check for the second condition ([df.score<0]), but begin to look from the index found for the first condition ([i:])
Related
I want to check if the value in both datasets is equal. But the datasets are not in the same order so need to loop through the datasets.
Dataset 1 contract : enter image description here
Part number
H50
H51
H53
ID001
1
1
1
ID002
1
1
1
ID003
0
1
0
ID004
1
1
1
ID005
1
1
1
data 2 anx : enter image description here
So the partnumber are not in the same order, but to check the value the partnumber needs to be equal from each file. Then if the part nr is the same, check if the Hcolumn is the same too. If both partnumber and the H(header)nr are the same, check if the value is the same.
Part number
H50
H51
H53
ID001
1
1
1
ID003
0
0
1
ID004
0
1
1
ID002
1
0
1
ID005
1
1
1
Expecting outcome:
If the value 1==1 or 0 == 0 from both dataset -> change to TRUE.
If the value = 1 in dataset1 but = 0 in dataset2 -> change the value to FALSE. and safe all the rows that contains FALSE value into an excel file name "Not in contract"
If the value = 0 in dataset1 but 1 in dataset2 -> change the value to FALSE
Example expected outcome
Part number
H50
H51
H53
ID001
TRUE
TRUE
TRUE
ID002
TRUE
FALSE
TRUE
ID003
TRUE
FALSE
FALSE
ID004
FALSE
TRUE
TRUE
ID005
TRUE
TRUE
TRUE
df_merged = df1.merge(df2, on='Part number')
a = df_merged[df_merged.columns[df_merged.columns.str.contains('_x')]]
b = df_merged[df_merged.columns[df_merged.columns.str.contains('_y')]]
out = pd.concat([df_merged['Part number'], pd.DataFrame(a.values == b.values, columns=df1.columns[1:4])], axis=1)
out
Part number H50 H51 H53
0 ID001 True True True
1 ID002 True False True
2 ID003 True False False
3 ID004 False True True
4 ID005 True True True
Need to compare two pandas dataframe with unequal number of rows and generate a new df with True for matching records and False for non matching and missing records.
df1:
date x y
0 2022-11-01 4 5
1 2022-11-02 12 5
2 2022-11-03 11 3
df2:
date x y
0 2022-11-01 4 5
1 2022-11-02 11 5
expected df_output:
date x y
0 True True True
1 False False False
2 False False False
Code:
df1 = pd.DataFrame({'date':['2022-11-01', '2022-11-02', '2022-11-03'],'x':[4,12,11],'y':[5,5,3]})
df2 = pd.DataFrame({'date':['2022-11-01', '2022-11-02'],'x':[4,11],'y':[5,5]})
df_output = pd.DataFrame(np.where(df1 == df2, True, False), columns=df1.columns)
print(df_output)
Error: ValueError: Can only compare identically-labeled DataFrame objects
You can use:
# cell to cell equality
# comparing by date
df3 = df1.eq(df1[['date']].merge(df2, on='date', how='left'))
# or to compare by index
# df3 = df1.eq(df2, axis=1)
# if you also want to turn a row to False if there is any False
df3 = (df3.T & df3.all(axis=1)).T
Output:
date x y
0 True True True
1 False False False
2 False False False
I have the following dataframe:
True_False
2018-01-02 True
2018-01-03 True
2018-01-04 False
2018-01-05 False
2018-01-08 False
... ...
2020-01-20 True
2020-01-21 True
2020-01-22 True
2020-01-23 True
2020-01-24 False
504 rows × 1 columns
I want to know how many successive True or False but not total it must stop counting after it toggles True or False. As such i want to eventually calculate mean(), max() and min() days. is it possible to show this data in Pandas?
Solution if all datetimes are consecutive:
You can create helper Series for consecutive groups by Series.shift and Series.cumsum, then get counts by GroupBy.size:
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s)
True_False True_False
False 2 3
4 1
True 1 2
3 4
dtype: int64
And last aggregate min, max and mean per first level of MultiIndex:
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 2 3 1
True 3 4 2
If datetimes are not consecutive first step is DataFrame.asfreq:
df = df.asfreq('d')
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 1.333333 2 1
True 3.000000 4 2
I have a DataFrame with two columns: one column is date and the other column contains values True or False.
Assume this code to get the Dataframe:
d_range=pd.date_range(start='01-01-2018', end='01-06-2018', freq='0.2D', )
d_range=d_range.date
my_list=[]
for i in range(0,d_range.size):
if 0<i<18:
my_list.append(False)
else:
my_list.append(True)
df=pd.DataFrame({'date':d_range, 'met criteria':my_list})
df.set_index(['date'])
This will give us this DataFrame:
print(df)
date criteria
0 2018-01-01 True
1 2018-01-01 False
2 2018-01-01 False
3 2018-01-01 False
4 2018-01-01 False
5 2018-01-02 False
6 2018-01-02 False
7 2018-01-02 False
8 2018-01-02 False
9 2018-01-02 False
10 2018-01-03 False
11 2018-01-03 False
12 2018-01-03 False
13 2018-01-03 False
14 2018-01-03 False
15 2018-01-04 False
16 2018-01-04 False
17 2018-01-04 False
18 2018-01-04 True
19 2018-01-04 True
20 2018-01-05 True
21 2018-01-05 True
22 2018-01-05 True
23 2018-01-05 True
24 2018-01-05 True
25 2018-01-06 True
I need an outcome that will group by 'date' and if there is at least one True value then the result will be True, otherwise it will be False.
The outcome should look like:
date criteria
2018-01-01 True
2018-01-02 False
2018-01-03 False
2018-01-04 True
2018-01-05 True
2018-01-06 True
Can you suggest some code that will do that, please?
Here is a way to do it:
In [1]:
import pandas as pd
d_range=pd.date_range(start='01-01-2018', end='01-06-2018', freq='0.2D', )
d_range=d_range.date
my_list=[]
for i in range(0,d_range.size):
if 0<i<18:
my_list.append(False)
else:
my_list.append(True)
df=pd.DataFrame({'date':d_range, 'met criteria':my_list})
def True_or_Not(x):
return x>0
df.groupby('date').sum().apply(True_or_Not)
df
Out [1]:
met criteria
date
2018-01-01 True
2018-01-02 False
2018-01-03 False
2018-01-04 True
2018-01-05 True
2018-01-06 True
You can use isin method on this. Basically, filter the dataframe using the unique values in your date column, then check each resulting dataframe if True exists in the criteria column.
Populate a dictionary based on the result, create a new dataframe with column 1 values equal to your dates and column 2 values equal to the mapped bool values from your_dict
date_unique = list(set(df['date'].values.tolist()))
your_dict = {}
for d in date_unique:
test_df = df[df['date'].isin([d])]
if 'True' in test_df['criteria']:
your_dict[d] = True
else:
your_dict[d] = False
output_df = pd.DataFrame()
output_df['date'] = date_unique
output_df['criteria'] = output_df['date'].map(your_dict)
Please note that 'True' is different from True, one is a string and the other is a bool data type in python. Whatever/however you read your criteria column in the original dataframe, you should apply the correct type to the condition inside the loop.
I have the following df,
id match_type amount negative_amount
1 exact 10 False
1 exact 20 False
1 name 30 False
1 name 40 False
1 amount 15 True
1 amount 15 True
2 exact 0 False
2 exact 0 False
I want to create a column 0_amount_sum that indicates (boolean) if the amount sum is <= 0 or not for each id of a particular match_type, e.g. the following is the result df;
id match_type amount 0_amount_sum negative_amount
1 exact 10 False False
1 exact 20 False False
1 name 30 False False
1 name 40 False False
1 amount 15 True True
1 amount 15 True True
2 exact 0 True False
2 exact 0 True False
for id=1 and match_type=exact, the amount sum is 30, so 0_amount_sum is False. The code is as follows,
df = df.loc[df.match_type=='exact']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
df = df.loc[df.match_type=='name']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
df = df.loc[df.match_type=='amount']
df['0_amount_sum_'] = (df.assign(
amount_n=df.amount * np.where(df.negative_amount, -1, 1)).groupby(
'id')['amount_n'].transform(lambda x: sum(x) <= 0))
I am wondering if there is a better way/more efficient to do that, especially when the values of match_type is unknown, so the code can automatically enumerate all the possible values and then do the calculation accordingly.
I believe need groupby by 2 Series (columns) instead filtering:
df['0_amount_sum_'] = ((df.amount * np.where(df.negative_amount, -1, 1))
.groupby([df['id'], df['match_type']])
.transform('sum')
.le(0))
id match_type amount negative_amount 0_amount_sum_
0 1 exact 10 False False
1 1 exact 20 False False
2 1 name 30 False False
3 1 name 40 False False
4 1 amount 15 True True
5 1 amount 15 True True
6 2 exact 0 False True
7 2 exact 0 False True