compare one column value with all the values of other column using pandas - excel

I have the one excel file which contains the below values
I need to compare a_id value with all the value of b_id and if it matches i have to update the value of a_flag to 1 otherwise 0.
For example take the first value in a_tag ie; 123 then compare all the values of b_id(113,211,222,123) . When it reaches to 123 in b_id we can see it matches. So we will update the value of a_flag as 1.
Just like that take all the values of a_id and compare with all the values of b_id. So after everything done we will have value either 1 or 0 in a_flag column.
Once its done we will take the first value of b_id then compare with all the value in a_id column and update b_flag column accordingly.
Finally i will have the below data.
I need to this using pandas because i am dealing with large collection of data. Below is my findings but it compare only with the first value of b_id. For example it compares 123(a_id first value) with 113 only (b_id first value).
import pandas as pd
df1 = pd.read_excel('system_data.xlsx')
df1['a_flag'] = (df3['a_id'] == df3['b_id']).astype(int)

Use Series.isin for test membership:
df1['a_flag'] = df3['a_id'].isin(df3['b_id']).astype(int)
df1['b_flag'] = df3['b_id'].isin(df3['a_id']).astype(int)

Related

How to look into previous three row values to Current Row in Python after applying Group by

How I can get the following expected output in python
Sample Input with Expected Output
ACTUAL_EXPECTED_OUTPUT is the expected output column Column.
The scenario is for each account we need to look into IS_DEFAULT COlumn prior three observations and if 1 is there in any of the previous three observation we need to get result as 1 else 0.
Group by the account id and if needed we can use order by MONTH_SINCE_DISB and then for each account id we need to look into prior three observations if 1 is there in any of the three observations for that account id then the new column label should be marked as 1 else 0. Iteratively the same logic should be applied for all accounts_id
Something like this should work
#Create temp column where when first 1 found, ffill the rest to 1 for that ACCT_ID
df['ISDEFAULT_TEMP']=df.groupby('ACCT_ID')['IS_DEFAULT'].apply(lambda x: x.replace(to_replace=0,method='ffill'))
import numpy as np
#Create condition using that new column and if the cumsum >2 for an AcctID , then true
# (.i.e. a IS_DEFAULT=1 has been seen 2 rows ago)
cond=df.groupby('ACCT_ID')['ISDEFAULT_TEMP'].transform('cumsum')>2
#Define that new column given the condition
df['ACTUAL_EXPECTED_OUTPUT']=np.where(cond,1,0)
df.drop('ISDEFAULT_TEMP',axis=1,inplace=True)
df

How to merge two dataframes based on OR condition on multiple columns?

Elaborating from this question, I am trying to match two Pandas dataframes.
The matching condition is:
(left_df.column_to_match == right_df.first_column) | (left_df.column_to_match == right_df.second_column )
or, in words, the column to match in the left dataframe should be equal to either first or second column in the right dataframe - thus the OR condition.
I can make a workaround using pd.merge and inputting lists such as
left_df.merge(right_df, left_on=['to_match', 'to_match'], right_on=['first_column', 'second_column'])
but this, in turn, gives me only the AND condition result when both columns match. That is, the two columns in right_df have the same value.
This is an example of input data
// left df // right df
To Match First Second
0 TCNU4843483 0 ASDREF TCNU4843483
1 MA18219 1 MA18219 Null
2 MA81192 2 Null Null
3 MFREIGHT 3 HROB789 NESU6748392
and this of the expected output
To Match First Second
0 TCNU4843483 ASDREF TCNU4843483
1 MA18219 MA18219 Null
2 MA81192 Null Null
3 MFREIGHT Null Null
4 Null HROB789 NESU6748392
Any idea about whether Pandas support this, or I have to write my own function?

Code optimisation - comparing two datetime columns by month and creating a new column too slow

I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.
my_list_of_ids = df[df.bool_column == 1].id.values
def my_func(date1, date2):
for id_ in df.id:
if id_ in my_list_of_ids:
if date1.month == date2.month:
my_var = 1
else:
my_var = 0
else:
my_var = 0
return my_var
df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)
Been waiting for 30 minutes and still 0%. Any help is appreciated.
UPDATE (adding an example):
id | date1 | date2 | bool_column | new_column |
id1 2019-02-13 2019-04-11 1 0
id1 2019-03-15 2019-04-11 0 0
id1 2019-04-23 2019-04-11 0 1
id2 2019-08-22 2019-08-11 1 1
id2 ....
id3 2019-09-01 2019-09-30 1 1
.
.
.
What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.
The pandas way to do this is
mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})
Since you have a large data-set, this will take time, but should be faster than using apply
The best way to deal with the month match is to use vectorization in pandas and do this:
new_column = (df.date1.dt.month == df.date2.dt.month).astype(int)
That is, avoid using apply() over the DataFrame (which will probably be iterative) and take advantage of the underlying numpy vectorization. The gateway to such functionality is almost always in families of Series functions and properties, like the dt family for dates, str family for strings, and so forth.
Luckily, you have pre-computed the id_list membership in your bool_column, so to add membership as a criterion, just do this:
new_column = ((df.date1.dt.month == df.date2.dt.month) & df.bool_column).astype(int)
Once again, the & of two Series takes advantage of vectorization. You stay inside boolean space till the end, then cast to int with astype(int). Reviewing your code, it occurs to me that the iterative checking of your id_list may be the real performance hit here, even more so than the DataFrame.apply(). Whatever you do, avoid at all costs iterating your id_list at each row, since you already have a vector denoting membership in your bool_column.
By the way I believe there's a tiny error in your example data, the new_column value for your third row should be 0, since your bool_column value there is 0.

How to select bunch of rows

I have dataframe with multiple columns , i want to select bunch of rows if column B have consecutive 1 and check in these rows if column A have any value equal to 0.04 then need this bunch of rows and extract start value and end value of column A for this bunch of rows
Here is my dataframe
Here is my desired output:
filtter Consecutive groups .diff().abs().cumsum().bfill() not following the specific considitons (x['B'].eq(1).any() and x['A'].eq(0.04).any()
agg first and last
followed by grouping consecutivity column to extract first and last rows with use of agg fun
df['temp'] = df.B.diff().abs().cumsum().bfill()
df.groupby('temp').filter(lambda x: (x['B'].eq(1).any() and x['A'].eq(0.04).any()))\
.groupby('temp').agg({'A':['first','last']})
Out:
A
first last
temp
3.0 344.0 39.9

pandas generates a new column based on values from another column considering duplicates

I am working on a dataframe which has a column that each value is a list, now I want to derive a new column which only considers list whose size is greater than 1, assigns a unique integer to the corresponding row as id. If elements in two lists are the same but with a different order, the two lists should be assigned the same id. A sample dataframe is like,
document_no_list cluster_id
[1,2,3] 1
[3,2,1] 1
[4,5,6,7] 2
[8] 0
[9,10] 3
[10,9] 3
column cluster_id only considers the 1st, 2nd, 3rd, 5th and 6th row, each of which has a size greater than 1, and assigns a unique integer id to its corresponding cell in the column, also [1,2,3], [3,2,1] and [9,10], [10,9] should be assigned the same cluster_id.
I was asking a similar question without considering duplicates list values, at
pandas how to derived values for a new column base on another column
I am wondering how to do that in pandas.
First, you need to assign a column with the list lengths, and another column with the lists as set objects sorted:
df['list_len'] = df.document_no_list.apply(len)
df['list_sorted'] = df.document_no_list.apply(sorted)
Then you need to assign the cluster_id for each set sorted list:
ids = df.loc[df.list_len > 1, ['list_sorted']].drop_duplicates()
ids['cluster_id'] = range(1,len(ids)+1)
Left join this onto the original dataframe, and fill whatever that hasn't been joined (the singletons) with zeros:
df.merge(ids, how = 'left').fillna({'cluster_id':0})

Resources