How to look into previous three row values to Current Row in Python after applying Group by - python-3.x

How I can get the following expected output in python
Sample Input with Expected Output
ACTUAL_EXPECTED_OUTPUT is the expected output column Column.
The scenario is for each account we need to look into IS_DEFAULT COlumn prior three observations and if 1 is there in any of the previous three observation we need to get result as 1 else 0.
Group by the account id and if needed we can use order by MONTH_SINCE_DISB and then for each account id we need to look into prior three observations if 1 is there in any of the three observations for that account id then the new column label should be marked as 1 else 0. Iteratively the same logic should be applied for all accounts_id

Something like this should work
#Create temp column where when first 1 found, ffill the rest to 1 for that ACCT_ID
df['ISDEFAULT_TEMP']=df.groupby('ACCT_ID')['IS_DEFAULT'].apply(lambda x: x.replace(to_replace=0,method='ffill'))
import numpy as np
#Create condition using that new column and if the cumsum >2 for an AcctID , then true
# (.i.e. a IS_DEFAULT=1 has been seen 2 rows ago)
cond=df.groupby('ACCT_ID')['ISDEFAULT_TEMP'].transform('cumsum')>2
#Define that new column given the condition
df['ACTUAL_EXPECTED_OUTPUT']=np.where(cond,1,0)
df.drop('ISDEFAULT_TEMP',axis=1,inplace=True)
df

Related

Sampling a dataframe according to some rules: balancing a multilabel dataset

I have a dataframe like this:
df = pd.DataFrame({'id':[10,20,30,40],'text':['some text','another text','random stuff', 'my cat is a god'],
'A':[0,0,1,1],
'B':[1,1,0,0],
'C':[0,0,0,1],
'D':[1,0,1,0]})
Here I have columns from Ato D but my real dataframe has 100 columns with values of 0and 1. This real dataframe has 100k reacords.
For example, the column A is related to the 3rd and 4rd row of text, because it is labeled as 1. The Same way, A is not related to the 1st and 2nd rows of text because it is labeled as 0.
What I need to do is to sample this dataframe in a way that I have the same or about the same number of features.
In this case, the feature C has only one occurrece, so I need to filter all others columns in a way that I have one text with A, one text with B, one text with Cetc..
The best would be: I can set using for example n=100 that means I want to sample in a way that I have 100 records with all the features.
This dataset is a multilabel dataset training and is higly unbalanced, I am looking for the best way to balance it for a machine learning task.
Important: I don't want to exclude the 0 features. I just want to have ABOUT the same number of columns with 1 and 0
For example. with a final data set with 1k records, I would like to have all columns from A to the final_column and all these columns with the same numbers of 1 and 0. To accomplish this I will need to random discard text rows and id only.
The approach I was trying was to look to the feature with the lowest 1 and 0 counts and then use this value as threshold.
Edit 1: One possible way I thought is to use:
df.sum(axis=0, skipna=True)
Then I can use the column with the lowest sum value as threshold to filter the text column. I dont know how to do this filtering step
Thanks
The exact output you expect is unclear, but assuming you want to get 1 random row per letter with 1 you could reshape (while dropping the 0s) and use GroupBy.sample:
(df
.set_index(['id', 'text'])
.replace(0, float('nan'))
.stack()
.groupby(level=-1).sample(n=1)
.reset_index()
)
NB. you can rename the columns if needed
output:
id text level_2 0
0 30 random stuff A 1.0
1 20 another text B 1.0
2 40 my cat is a god C 1.0
3 30 random stuff D 1.0

assign value based on timestamps

I got two dataframes, the first one is like
data = [
[11,'a',1],
[16,'b',2],
[15,'a',3],
[19,'b',4]
]
data=pd.DataFrame(data)
and the second one is like
find=[
[4,'a'],
[11,'b'],
[11,'a'],
[16,'b'],
[17,'a'],
]
find=pd.DataFrame(find)
I'd like to assign values to the second dataframe based on the first dataframe. There are few conditions need to be checked, for example:
1. if the 1st row is 4 and a, then return 1
2. if the 2nd row is 11 and b, then return 2
3. if the 3rd row is 11 and a, then return 1
2. if the 4th row is 16 and b, then return 4
I tried to write for loop to do this, but the dataset is pretty big so it takes too much time for running and in the end, it fails.
Is there any good solution for this question? Appreciate!

compare one column value with all the values of other column using pandas

I have the one excel file which contains the below values
I need to compare a_id value with all the value of b_id and if it matches i have to update the value of a_flag to 1 otherwise 0.
For example take the first value in a_tag ie; 123 then compare all the values of b_id(113,211,222,123) . When it reaches to 123 in b_id we can see it matches. So we will update the value of a_flag as 1.
Just like that take all the values of a_id and compare with all the values of b_id. So after everything done we will have value either 1 or 0 in a_flag column.
Once its done we will take the first value of b_id then compare with all the value in a_id column and update b_flag column accordingly.
Finally i will have the below data.
I need to this using pandas because i am dealing with large collection of data. Below is my findings but it compare only with the first value of b_id. For example it compares 123(a_id first value) with 113 only (b_id first value).
import pandas as pd
df1 = pd.read_excel('system_data.xlsx')
df1['a_flag'] = (df3['a_id'] == df3['b_id']).astype(int)
Use Series.isin for test membership:
df1['a_flag'] = df3['a_id'].isin(df3['b_id']).astype(int)
df1['b_flag'] = df3['b_id'].isin(df3['a_id']).astype(int)

How to select bunch of rows

I have dataframe with multiple columns , i want to select bunch of rows if column B have consecutive 1 and check in these rows if column A have any value equal to 0.04 then need this bunch of rows and extract start value and end value of column A for this bunch of rows
Here is my dataframe
Here is my desired output:
filtter Consecutive groups .diff().abs().cumsum().bfill() not following the specific considitons (x['B'].eq(1).any() and x['A'].eq(0.04).any()
agg first and last
followed by grouping consecutivity column to extract first and last rows with use of agg fun
df['temp'] = df.B.diff().abs().cumsum().bfill()
df.groupby('temp').filter(lambda x: (x['B'].eq(1).any() and x['A'].eq(0.04).any()))\
.groupby('temp').agg({'A':['first','last']})
Out:
A
first last
temp
3.0 344.0 39.9

pandas merged data length

I have two data frames, each has one column with the same values (and equal length) but different order as in simplified example;
df1=pd.DataFrame(['a','b','c','d','e'],columns=['names'])
df2=pd.DataFrame(['b','e','a','c','d'],columns=['names'])
I want to know the corresponding index of each row in df1 in df2 and do;
df= pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
this works and as expected for this example,the length of the data frames are equal len(df1)=len(df2)=len(df)
However in my real data, len(df1)=len(df2)=1714 and len(df)=1676
I am puzzled, how is this possible?
I just did an experiment and added duplicates.
df1=pd.DataFrame(['e','a','b','c','d','e'],columns=['names'])
df2=pd.DataFrame(['b','e','a','e','c','d'],columns=['names'])
df= pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
This gives len(df)=8 larger than len(df1)=len(df2)=6.
But in my real data df is smaller than individual df lengths.
Since pandas merge default is inner join , when you not specific the method of how , it will only output the row both in two dfs
For example :
df1=pd.DataFrame(['a'],columns=['names'])
df2=pd.DataFrame(['b','e','a','c','d'],columns=['names'])
pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
index_x names index_y
0 0 a 2
Update
df1=pd.DataFrame(['a','a'],columns=['names'])
df2=pd.DataFrame(['b','e','a','a','c','d'],columns=['names'])
df1.merge(df2)
names
0 a
1 a
2 a
3 a

Resources