How to pandas-groupby a column and get a specific result per group depending on the values of another column? - python-3.x

I have a DataFrame with two columns: one column is date and the other column contains values True or False.
Assume this code to get the Dataframe:
d_range=pd.date_range(start='01-01-2018', end='01-06-2018', freq='0.2D', )
d_range=d_range.date
my_list=[]
for i in range(0,d_range.size):
if 0<i<18:
my_list.append(False)
else:
my_list.append(True)
df=pd.DataFrame({'date':d_range, 'met criteria':my_list})
df.set_index(['date'])
This will give us this DataFrame:
print(df)
date criteria
0 2018-01-01 True
1 2018-01-01 False
2 2018-01-01 False
3 2018-01-01 False
4 2018-01-01 False
5 2018-01-02 False
6 2018-01-02 False
7 2018-01-02 False
8 2018-01-02 False
9 2018-01-02 False
10 2018-01-03 False
11 2018-01-03 False
12 2018-01-03 False
13 2018-01-03 False
14 2018-01-03 False
15 2018-01-04 False
16 2018-01-04 False
17 2018-01-04 False
18 2018-01-04 True
19 2018-01-04 True
20 2018-01-05 True
21 2018-01-05 True
22 2018-01-05 True
23 2018-01-05 True
24 2018-01-05 True
25 2018-01-06 True
I need an outcome that will group by 'date' and if there is at least one True value then the result will be True, otherwise it will be False.
The outcome should look like:
date criteria
2018-01-01 True
2018-01-02 False
2018-01-03 False
2018-01-04 True
2018-01-05 True
2018-01-06 True
Can you suggest some code that will do that, please?

Here is a way to do it:
In [1]:
import pandas as pd
d_range=pd.date_range(start='01-01-2018', end='01-06-2018', freq='0.2D', )
d_range=d_range.date
my_list=[]
for i in range(0,d_range.size):
if 0<i<18:
my_list.append(False)
else:
my_list.append(True)
df=pd.DataFrame({'date':d_range, 'met criteria':my_list})
def True_or_Not(x):
return x>0
df.groupby('date').sum().apply(True_or_Not)
df
Out [1]:
met criteria
date
2018-01-01 True
2018-01-02 False
2018-01-03 False
2018-01-04 True
2018-01-05 True
2018-01-06 True

You can use isin method on this. Basically, filter the dataframe using the unique values in your date column, then check each resulting dataframe if True exists in the criteria column.
Populate a dictionary based on the result, create a new dataframe with column 1 values equal to your dates and column 2 values equal to the mapped bool values from your_dict
date_unique = list(set(df['date'].values.tolist()))
your_dict = {}
for d in date_unique:
test_df = df[df['date'].isin([d])]
if 'True' in test_df['criteria']:
your_dict[d] = True
else:
your_dict[d] = False
output_df = pd.DataFrame()
output_df['date'] = date_unique
output_df['criteria'] = output_df['date'].map(your_dict)
Please note that 'True' is different from True, one is a string and the other is a bool data type in python. Whatever/however you read your criteria column in the original dataframe, you should apply the correct type to the condition inside the loop.

Related

I need to match the unique value of rows from one dataset to the columns matching in another dataset and provide the dataframe

Below is the dataframe example where id is the index
df:
id
A
B
C
1
False
False
NA
2
True
False
NA
3
False
True
True
df2:
A
B
C
D
True
False
NA
True
False
True
False
False
False
True
True
True
False
True
True
True
False
True
True
True
False
True
True
True
False
True
True
True
False
True
True
True
Output:
Here we are matching the unique row if the id of df matches with the columns of df2 and has true
values in df2 columns then sum it per id of df and provide the data frame of the same index and ignoring d column in df2
id
A
B
C
Sum of matched true values in columns of df2
1
False
False
NA
0
2
True
False
NA
2
3
False
True
True
6
match_df = try_df.merge(df, on= list_new , how='outer',suffixes=('', '_y'))
match_df.drop(match_df.filter(regex='_y$').columns, axis=1, inplace=True)
df_grouped = match_df.groupby('CIS Sub Controls')[list_new].agg(['sum', 'count'])
df_final = pd.concat([df_grouped['col1']['sum'], df_grouped['col2']['sum'], df_grouped['col3']['sum'], df_grouped['col4']['sum'], df_grouped['col1']['count'], df_grouped['col2']['count'], df_grouped['col3']['count'], df_grouped['col4']['count']], axis=1).join(df_grouped.index)
This is not how it goes
You can use value_counts and merge:
cols = df1.columns.intersection(df2.columns)
out = (df1.merge(df2[cols].value_counts(dropna=False).reset_index(name='sum'),
how='left')
.fillna({'sum': 0}, downcast='infer')
)
Output:
id A B C sum
0 1 False False NaN 0
1 2 True False NaN 1
2 3 False True True 6

Counting True or False

I have the following dataframe:
True_False
2018-01-02 True
2018-01-03 True
2018-01-04 False
2018-01-05 False
2018-01-08 False
... ...
2020-01-20 True
2020-01-21 True
2020-01-22 True
2020-01-23 True
2020-01-24 False
504 rows × 1 columns
I want to know how many successive True or False but not total it must stop counting after it toggles True or False. As such i want to eventually calculate mean(), max() and min() days. is it possible to show this data in Pandas?
Solution if all datetimes are consecutive:
You can create helper Series for consecutive groups by Series.shift and Series.cumsum, then get counts by GroupBy.size:
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s)
True_False True_False
False 2 3
4 1
True 1 2
3 4
dtype: int64
And last aggregate min, max and mean per first level of MultiIndex:
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 2 3 1
True 3 4 2
If datetimes are not consecutive first step is DataFrame.asfreq:
df = df.asfreq('d')
g = df['True_False'].ne(df['True_False'].shift()).cumsum()
s = df.groupby(['True_False',g]).size()
print (s.groupby(level=0).agg(['mean','max','min']))
mean max min
True_False
False 1.333333 2 1
True 3.000000 4 2

Pandas first date condition is met while another condition is active

I have a dataframe with a time series of scores. My goal is to detect when the score is larger than a certain threshold th and then to find when the score goes back to 0. Is quite easy to find each condition separately
dates_1 = score > th
dates_2 = np.sign(score[1:]) == np.sign(score.shift(1).dropna())
However, I don't know what's the most pythonic way to override dates_2 so that only dates when an 'active' date_1 has been observed
Perhaps using an auxiliary column 'active' set to 1 whenever score > th is True and set it to False when the condition for dates_2 is met. That way I can ask for the change in sign AND active == True. However, that approach requires iteration and I'm wondering if there's a vectorized solution to my problem
Any thoughts on how to improve my approach?
Sample data:
date score
2010-01-04 0.0
2010-01-05 -0.3667779798467592
2010-01-06 -1.9641427199568868
2010-01-07 -0.49976215445519134
2010-01-08 -0.7069108074548405
2010-01-11 -1.4624766212523337
2010-01-12 -0.9132777669357441
2010-01-13 0.16204588193577152
2010-01-14 0.958085568609925
2010-01-15 1.4683022129399834
2010-01-19 3.036016680985081
2010-01-20 2.2357911432637345
2010-01-21 2.8827438241030707
2010-01-22 -3.395977874791837
Expected Output
if th = 0.94
date active
2010-01-04 False
2010-01-05 False
2010-01-06 False
2010-01-07 False
2010-01-08 False
2010-01-11 False
2010-01-12 False
2010-01-13 False
2010-01-14 True
2010-01-15 True
2010-01-19 True
2010-01-20 True
2010-01-21 True
2010-01-22 False
Not Vectorized!
def alt_cond(s, th):
active = False
for x in s:
active = [x >= th, x > 0][int(active)]
yield active
df.assign(A=[*alt_cond(df.score, 0.94)])
date score A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
Vectorized (Sort Of)
I used Numba to really speed things up. Still a loop but should be very fast if you can install numba
from numba import njit
#njit
def alt_cond(s, th):
active = False
out = np.zeros(len(s), dtype=np.bool8)
for i, x in enumerate(s):
if active:
if x <= 0:
active = False
else:
if x >= th:
active = True
out[i] = active
return out
df.assign(A=alt_cond(df.score.values, .94))
Response to Comment
You can have a dictionary of column names and threshold values and iterate
th = {'score': 0.94}
df.join(pd.DataFrame(
np.column_stack([[*alt_cond(df[k], v)] for k, v in th.items()]),
df.index, [f"{k}_A" for k in th]
))
date score score_A
0 2010-01-04 0.000000 False
1 2010-01-05 -0.366778 False
2 2010-01-06 -1.964143 False
3 2010-01-07 -0.499762 False
4 2010-01-08 -0.706911 False
5 2010-01-11 -1.462477 False
6 2010-01-12 -0.913278 False
7 2010-01-13 0.162046 False
8 2010-01-14 0.958086 True
9 2010-01-15 1.468302 True
10 2010-01-19 3.036017 True
11 2010-01-20 2.235791 True
12 2010-01-21 2.882744 True
13 2010-01-22 -3.395978 False
I'm assuming your data is in a pandas dataframe, and 'date' is your index column. Then this would be the way I'd do it:
th = 0.94 # Threshold value
i = df[df.score>th].index[0] # Check the index for the first condition
df[i:][df.score<0].index[0] # Check the index for the second condition, after the index of the first condition
So use conditional indexing to find the index for the first condition ([df.score>th]), then check for the second condition ([df.score<0]), but begin to look from the index found for the first condition ([i:])

Combine rows based on index or column

I have three dataframes: df1, df2, df3. I am trying to add a list of ART_UNIT do df1.
df1 is 260846 rows x 4 columns:
Index SYMBOL level not-allocatable additional-only
0 A 2 True False
1 A01 4 True False
2 A01B 5 True False
3 A01B1/00 7 False False
4 A01B1/02 8 False False
5 A01B1/022 9 False False
6 A01B1/024 9 False False
7 A01B1/026 9 False False
df2 is 941516 rows x 2 columns:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
0 A44C27/00 3715
1 A44C27/001 2015
2 A44C27/001 3715
3 A44C27/001 2615
4 A44C27/005 2815
5 A44C27/006 3725
6 A44C27/007 3215
7 A44C27/008 3715
8 F41A33/00 3715
9 F41A33/02 3715
10 F41A33/04 3715
11 F41A33/06 3715
12 G07C13/00 3715
13 G07C13/005 3715
14 G07C13/02 3716
And df3 is the same format as df2, but has 673023 rows x 2 columns
The 'CLASSIFICATION_SYMBOL_CD' in df2 and df3 are not unique.
For each 'CLASSIFICATION_SYMBOL_CD' in df2 and df3, I want to find the same string in df1 'SYMBOL' and add a new column to df1 'ART_UNIT' that contains all of the 'ART_UNIT' from df2 and df3.
For example, in df2, 'CLASSIFICATION_SYMBOL_CD' A44C27/001 has ART_UNIT 2015, 3715, and 2615.
I want to write those ART_UNIT to the correct row in df1 so that is reads:
Index SYMBOL level not-allocatable additional-only ART_UNIT
211 A44C27/001 2 True False [2015, 3715, 2615]
So far, I've tried to group df2/df3 by 'CLASSIFICATION_SYMBOL_CD'
gp = df2.groupby(['CLASSIFICATION_SYMBOL_CD'])
for x in df2['CLASSIFICATION_SYMBOL_CD'].unique():
df2_g = gp.get_group(x)
Which gives me:
Index CLASSIFICATION_SYMBOL_CD ART_UNIT
1354 A61N1/3714 3762
117752 A61N1/3714 3766
347573 A61N1/3714 3736
548026 A61N1/3714 3762
560771 A61N1/3714 3762
566120 A61N1/3714 3766
566178 A61N1/3714 3762
799486 A61N1/3714 3736
802408 A61N1/3714 3736
Since df2 and df3 have the same format concatentate them first.
import pandas as pd
df = pd.concat([df2, df3])
Then to get the lists of all art units, groupby and apply list.
df = df.groupby('CLASSIFICATION_SYMBOL_CD').ART_UNIT.apply(list).reset_index()
# CLASSIFICATION_SYMBOL_CD ART_UNIT
#0 A44C27/00 [3715]
#1 A44C27/001 [2015, 3715, 2615]
#2 A44C27/005 [2815]
#3 A44C27/006 [3725]
#...
Finally, bring this information to df1 with a merge (you could map or something else too). Rename the column first to have less to clean up after the merge.
df = df.rename(columns={'CLASSIFICATION_SYMBOL_CD': 'SYMBOL'})
df1 = df1.merge(df, on='SYMBOL', how='left')
Output:
Index SYMBOL level not-allocatable additional-only ART_UNIT
0 0 A 2 True False NaN
1 1 A01 4 True False NaN
2 2 A01B 5 True False NaN
3 3 A01B1/00 7 False False NaN
4 4 A01B1/02 8 False False NaN
5 5 A01B1/022 9 False False NaN
6 6 A01B1/024 9 False False NaN
7 7 A01B1/026 9 False False NaN
Sadly, you didn't provide any overlapping SYMBOLs in df1, so nothing merged. But this will work with your full data.

How to use the condition with other rows (previous moments in time series data), in pandas, python3

I have a pandas.dataframe df. It is a time series data, with 1000 rows and 3 columns. What I want is given in the pseudo-code below.
for each row
if the value in column 'colA' at [this_row-1] is higher than
the value in column 'B' at [this_row-2] for more than 3%
then set the value in 'colCheck' at [this_row] as True.
Finally, pickout all the rows in the df where 'colCheck' are True.
I will use the following example to further demonstrate my purpose.
df =
'colA', 'colB', 'colCheck'
Dates
2017-01-01, 20, 30, NAN
2017-01-02, 10, 40, NAN
2017-01-03, 50, 20, False
2017-01-04, 40, 10, True
First, when this_row = 2 (the 3rd row, where the date is 2017-01-03), the value in colA at [this_row-1] is 10, the value in colB at [this_row-2] is 30. So (10-30)/30 = -67% < 3%, so the value in colCheck at [this_row] is False.
Likewise, when this_row = 3, (50-40)/40 = 25% > 3%, so the value in colCheck at [this_row] is True.
Last but not least, the first two rows in colCheck should be NAN, since the calculation needs to access [this_row-2] in colB. But the first two rows do not have [this_row-2].
Besides, the criteria of 3% and [row-1] in colA, [row-2] in colB are just examples. In my real project, they are situational, e.g. 4% and [row-3].
I am looking for concise and elegant approach. I am using Python3.
Thanks.
You can rearrange the maths and use pd.Series.shift
df.colA.shift(1).div(df.colB.shift(2)).gt(1.03)
Dates
2017-01-01 False
2017-01-02 False
2017-01-03 False
2017-01-04 True
dtype: bool
Using pd.DataFrame.assign we can create a copy with the new column
df.assign(colCheck=df.colA.shift(1).div(df.colB.shift(2)).gt(1.03))
colA colB colCheck
Dates
2017-01-01 20 30 False
2017-01-02 10 40 False
2017-01-03 50 20 False
2017-01-04 40 10 True
If you insisted on leaving the first two as NaN, you could use iloc
df.assign(colCheck=df.colA.shift(1).div(df.colB.shift(2)).gt(1.03).iloc[2:])
colA colB colCheck
Dates
2017-01-01 20 30 NaN
2017-01-02 10 40 NaN
2017-01-03 50 20 False
2017-01-04 40 10 True
And for maximum clarity:
# This creates a boolean array of when your conditions are met
colCheck = (df.colA.shift(1) / df.colB.shift(2)) > 1.03
# This chops off the first two `False` values and creates a new
# column named `colCheck` and assigns to it the boolean values
# calculate just above.
df.assign(colCheck=colCheck.iloc[2:])
colA colB colCheck
Dates
2017-01-01 20 30 NaN
2017-01-02 10 40 NaN
2017-01-03 50 20 False
2017-01-04 40 10 True

Resources