How to merge two dataframes based on OR condition on multiple columns? - python-3.x

Elaborating from this question, I am trying to match two Pandas dataframes.
The matching condition is:
(left_df.column_to_match == right_df.first_column) | (left_df.column_to_match == right_df.second_column )
or, in words, the column to match in the left dataframe should be equal to either first or second column in the right dataframe - thus the OR condition.
I can make a workaround using pd.merge and inputting lists such as
left_df.merge(right_df, left_on=['to_match', 'to_match'], right_on=['first_column', 'second_column'])
but this, in turn, gives me only the AND condition result when both columns match. That is, the two columns in right_df have the same value.
This is an example of input data
// left df // right df
To Match First Second
0 TCNU4843483 0 ASDREF TCNU4843483
1 MA18219 1 MA18219 Null
2 MA81192 2 Null Null
3 MFREIGHT 3 HROB789 NESU6748392
and this of the expected output
To Match First Second
0 TCNU4843483 ASDREF TCNU4843483
1 MA18219 MA18219 Null
2 MA81192 Null Null
3 MFREIGHT Null Null
4 Null HROB789 NESU6748392
Any idea about whether Pandas support this, or I have to write my own function?

Related

How to call a created funcion with pandas apply to all rows (axis=1) but only to some specific rows of a dataframe?

I have a function which sends automated messages to clients, and takes as input all the columns from a dataframe like the one below.
name
phone
status
date
name_1
phone_1
sending
today
name_2
phone_2
sending
yesterday
I iterate through the dataframe with a pandas apply (axis=1) and use the values on the columns of each row as inputs to my function. At the end of it, after sending, it changes the status to "sent". The thing is I only want to send to the clients whose date reference is "today". Now, with pandas.apply(axis=1) this is perfectly doable, but in order to slice the clients with "today" value, I need to:
create a new dataframe with today's value,
remove it from the original, and then
reappend it to the original.
I thought about running through the whole dataframe and ignore the rows which have dates different than "today", but if my dataframe keeps growing, I'm afraid of the whole process becoming slower.
I saw examples of this being done with mask, although usually people only use 1 column, and I need more than just the one. Is there any way to do this with pandas apply?
Thank you.
I think you can use .loc to filter the data and apply func to it.
In [13]: df = pd.DataFrame(np.random.rand(5,5))
In [14]: df
Out[14]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.334781 0.521263 0.402030 0.973504 0.903314
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 0.902107 0.226398 0.596697 0.489761 0.535270
if we want double the values of rows where the value in first column > 0.3
Out[16]:
0 1 2 3 4
2 0.334781 0.521263 0.402030 0.973504 0.903314
4 0.902107 0.226398 0.596697 0.489761 0.535270
In [18]: df.loc[df[0] > 0.3] = df.loc[df[0] > 0.3].apply(lambda x: x*2, axis=1)
In [19]: df
Out[19]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.669563 1.042527 0.804061 1.947008 1.806628
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 1.804213 0.452797 1.193394 0.979522 1.070540

groupby with a boolean array, like a.groupby([True, True])

I once saw a code segment using groupby as follows
a.groupby([True]*len(a))
Here a is a dataframe. I do not understand what does this try to do? If a has two rows. Generally, it is a.groupby([True, True])
The groupby parameter must have a length equal to that of the dataframe which is the number of rows (if the parameter is a column name, then its that by default, its a list, it must have the same length). It can be a list of lists, where each sublist must again have the length equal to that of the Dataframe (number of rows)
Taking a toy dataset -
a = pd.DataFrame([[1,2,2],[3,4,5]], columns=['A','B','C'])
print(a)
A B C
0 1 2 2
1 3 4 5
Using the groupby function you are able to get a grouper object -
multiply, * operation on a list replicates it by the scaler
So, [True]*len(a) is the same as [True, True]
grp = a.groupby([True]*len(a))
grp
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x109ceb780>
If you list out the groups you will only get a single group -
list(grp)
[(True, A B C
0 1 2 2
1 3 4 5)]
Maybe the author of that code segment was trying to just create a single tuple?
This is not groupby, since the group key only have one unique True.
For all the function apply after groupby in
a.groupby([True]*len(a))
Can be done without groupby

compare one column value with all the values of other column using pandas

I have the one excel file which contains the below values
I need to compare a_id value with all the value of b_id and if it matches i have to update the value of a_flag to 1 otherwise 0.
For example take the first value in a_tag ie; 123 then compare all the values of b_id(113,211,222,123) . When it reaches to 123 in b_id we can see it matches. So we will update the value of a_flag as 1.
Just like that take all the values of a_id and compare with all the values of b_id. So after everything done we will have value either 1 or 0 in a_flag column.
Once its done we will take the first value of b_id then compare with all the value in a_id column and update b_flag column accordingly.
Finally i will have the below data.
I need to this using pandas because i am dealing with large collection of data. Below is my findings but it compare only with the first value of b_id. For example it compares 123(a_id first value) with 113 only (b_id first value).
import pandas as pd
df1 = pd.read_excel('system_data.xlsx')
df1['a_flag'] = (df3['a_id'] == df3['b_id']).astype(int)
Use Series.isin for test membership:
df1['a_flag'] = df3['a_id'].isin(df3['b_id']).astype(int)
df1['b_flag'] = df3['b_id'].isin(df3['a_id']).astype(int)

pandas merged data length

I have two data frames, each has one column with the same values (and equal length) but different order as in simplified example;
df1=pd.DataFrame(['a','b','c','d','e'],columns=['names'])
df2=pd.DataFrame(['b','e','a','c','d'],columns=['names'])
I want to know the corresponding index of each row in df1 in df2 and do;
df= pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
this works and as expected for this example,the length of the data frames are equal len(df1)=len(df2)=len(df)
However in my real data, len(df1)=len(df2)=1714 and len(df)=1676
I am puzzled, how is this possible?
I just did an experiment and added duplicates.
df1=pd.DataFrame(['e','a','b','c','d','e'],columns=['names'])
df2=pd.DataFrame(['b','e','a','e','c','d'],columns=['names'])
df= pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
This gives len(df)=8 larger than len(df1)=len(df2)=6.
But in my real data df is smaller than individual df lengths.
Since pandas merge default is inner join , when you not specific the method of how , it will only output the row both in two dfs
For example :
df1=pd.DataFrame(['a'],columns=['names'])
df2=pd.DataFrame(['b','e','a','c','d'],columns=['names'])
pd.merge(df1.reset_index(), df2.reset_index(), on=['names'])
index_x names index_y
0 0 a 2
Update
df1=pd.DataFrame(['a','a'],columns=['names'])
df2=pd.DataFrame(['b','e','a','a','c','d'],columns=['names'])
df1.merge(df2)
names
0 a
1 a
2 a
3 a

Filter columns based on a value (Pandas): TypeError: Could not compare ['a'] with block values

I'm trying filter a DataFrame columns based on a value.
In[41]: df = pd.DataFrame({'A':['a',2,3,4,5], 'B':[6,7,8,9,10]})
In[42]: df
Out[42]:
A B
0 a 6
1 2 7
2 3 8
3 4 9
4 5 10
Filtering columns:
In[43]: df.loc[:, (df != 6).iloc[0]]
Out[43]:
A
0 a
1 2
2 3
3 4
4 5
It works! But, When I used strings,
In[44]: df.loc[:, (df != 'a').iloc[0]]
I'm getting this error: TypeError: Could not compare ['a'] with block values
You are trying to compare string 'a' with numeric values in column B.
If you want your code to work, first promote dtype of column B as numpy.object, It will work.
df.B = df.B.astype(np.object)
Always check data types of the columns before performing the operations using
df.info()
You could do this with masks instead, for example:
df[df.A!='a'].A
and to filter from any column:
df[df.apply(lambda x: sum([x_=='a' for x_ in x])==0, axis=1)]
The problem is due to the fact that there are numeric and string objects in the dataframe.
You can loop through each column and check each column as a series for a specific value using
(Series=='a').any()

Resources