I need to find the count values using a specific row that matches the columns for the other dataframe the values which are true - python-3.x

For example I have two dataframes
df = [{'id':1, 'A': 'False', 'B' : 'False' , 'C':'NA'},{'id':2, 'A': 'True', 'B' : 'False' , 'C':'NA'}]
df2 = [{'A':'True', 'B': 'False' , 'C':'NA', 'D': 'True'}]
idea should be to count the values based on the row of df which matches, df2 columns
df:
id
A
B
C
1
False
False
NA
2
True
False
NA
df2:
A
B
C
D
True
False
NA
True
Output:
id
A
B
C
Count
1
False
False
NA
0
2
True
False
NA
1
I tried something like
for i in range(columns):
x = action_value_counts_df.columns[i]
if compare_column.equals(action_value_counts_df[x]):
print(x, 'Matched')
else:
print(x,'Not Matched')
This code did not help

Merge df, df2 using overlapping columns.
Count overlapping columns.
Merge count result DataFrame into df.
Replace NA with 0 in the 'Count' column.
import pandas as pd
df = pd.DataFrame([
{'id':1, 'A': 'False', 'B' : 'False' , 'C':'NA'},
{'id':2, 'A': 'True', 'B' : 'False' , 'C':'NA'}])
df2 = pd.DataFrame([
{'A':'True', 'B': 'False' , 'C':'NA', 'D': 'True'}])
# 1
match_df = pd.merge(left=df, right=df2, on=['A', 'B', 'C'], how='inner')
match_df = match_df.assign(Count=1)
"""
id A B C D Count
0 2 True False NA True 1
"""
# 2
match_count_df = match_df.groupby(['A', 'B', 'C'], as_index=False).count()
match_count_df = match_count_df[['A', 'B', 'C', 'Count']]
"""
A B C Count
0 True False NA 1
"""
# 3
result_df = pd.merge(left=df, right=match_count_df, how='left')
"""
id A B C Count
0 1 False False NA NaN
1 2 True False NA 1.0
"""
# 4
result_df.loc[:, 'Count'] = result_df['Count'].fillna(0)
"""
id A B C Count
0 1 False False NA 0.0
1 2 True False NA 1.0
"""

You can compare two dataframes rowwise with arranged columns (from df to df2) and ignored NaN values (as they are not comparable):
df.assign(Count=df.set_index('id').apply(lambda x: (x.dropna() == df2[x.index].squeeze().dropna()).all() * 1, axis=1).values)
id A B C Count
0 1 False False NaN 0
1 2 True False NaN 1

Related

Python: Compare 2 pandas dataframe with unequal number of rows

Need to compare two pandas dataframe with unequal number of rows and generate a new df with True for matching records and False for non matching and missing records.
df1:
date x y
0 2022-11-01 4 5
1 2022-11-02 12 5
2 2022-11-03 11 3
df2:
date x y
0 2022-11-01 4 5
1 2022-11-02 11 5
expected df_output:
date x y
0 True True True
1 False False False
2 False False False
Code:
df1 = pd.DataFrame({'date':['2022-11-01', '2022-11-02', '2022-11-03'],'x':[4,12,11],'y':[5,5,3]})
df2 = pd.DataFrame({'date':['2022-11-01', '2022-11-02'],'x':[4,11],'y':[5,5]})
df_output = pd.DataFrame(np.where(df1 == df2, True, False), columns=df1.columns)
print(df_output)
Error: ValueError: Can only compare identically-labeled DataFrame objects
You can use:
# cell to cell equality
# comparing by date
df3 = df1.eq(df1[['date']].merge(df2, on='date', how='left'))
# or to compare by index
# df3 = df1.eq(df2, axis=1)
# if you also want to turn a row to False if there is any False
df3 = (df3.T & df3.all(axis=1)).T
Output:
date x y
0 True True True
1 False False False
2 False False False

How to compare a string of one column of pandas with rest of the columns and if value is found in any column of the row append a new row?

I want to compare the Category column with all the predicted_site and if value matches with anyone column, append a column named rank and insert 1 if value is found or else insert 0
Use DataFrame.filter for predicted columns compared by DataFrame.eq with Category column, convert to integers, change columns names by DataFrame.add_prefix and last add new columns by DataFrame.join:
df = pd.DataFrame({
'category':list('abcabc'),
'B':[4,5,4,5,5,4],
'predicted1':list('adadbd'),
'predicted2':list('cbarac')
})
df1 = df.filter(like='predicted').eq(df['category'], axis=0).astype(int).add_prefix('new_')
df = df.join(df1)
print (df)
category B predicted1 predicted2 new_predicted1 new_predicted2
0 a 4 a c 1 0
1 b 5 d b 0 1
2 c 4 a a 0 0
3 a 5 d r 0 0
4 b 5 b a 1 0
5 c 4 d c 0 1
This solution is much less elegant than that proposed by #jezrael, however you can try it.
#sample dataframe
d = {'cat': ['comp-el', 'el', 'comp', 'comp-el', 'el', 'comp'], 'predicted1': ['com', 'al', 'p', 'col', 'el', 'comp'], 'predicted2': ['a', 'el', 'p', 'n', 's', 't']}
df = pd.DataFrame(data=d)
#iterating through rows
for i, row in df.iterrows():
#assigning values
cat = df.loc[i,'cat']
predicted1 = df.loc[i,'predicted1']
predicted2 = df.loc[i,'predicted2']
#condition
if (cat == predicted1 or cat == predicted2):
df.loc[i,'rank'] = 1
else:
df.loc[i,'rank'] = 0
output:
cat predicted1 predicted2 rank
0 comp-el com a 0.0
1 el al el 1.0
2 comp p p 0.0
3 comp-el col n 0.0
4 el el s 1.0
5 comp comp t 1.0

my pandas dataframe is not filterable by a column condition

I am trying to only show rows where values in column A are greater than 0. I applied the following code but I am not getting the right returned dataframe. Why?
in: df.info()
out:
A non-null int64
B non-null int64
in:df['A']>0
out:
A B
5 1
0 0
Obviously, the second row should NOT show. What is going on here?
The way you wrote the condition it's actually a filter (aka mask or predicate). You can take that filter and apply it to the DataFrame to get the actual rows:
In [1]: from pandas import DataFrame
In [2]: df = DataFrame({'A': range(5), 'B': ['a', 'b', 'c', 'd', 'e']})
In [3]: df
Out[3]:
A B
0 0 a
1 1 b
2 2 c
3 3 d
4 4 e
In [4]: df['A'] > 2
Out[4]:
0 False
1 False
2 False
3 True
4 True
Name: A, dtype: bool
In [5]: df[df['A'] > 2]
Out[5]:
A B
3 3 d
4 4 e
Another way to do the same thing is to use query():
In [6]: df.query('A > 2')
Out[6]:
A B
3 3 d
4 4 e

append one dataframe column value to another dataframe

I have two dataframes. df1 is empty dataframe and df2 is having some data as shown. There are few columns common in both dfs. I want to append df2 dataframe columns data into df1 dataframe's column. df3 is expected result.
I have referred Python + Pandas + dataframe : couldn't append one dataframe to another, but not working. It gives following error:
ValueError: Plan shapes are not aligned
df1:
Empty DataFrame
Columns: [a, b, c, d, e]
Index: [] `
df2:
c e
0 11 55
1 22 66
df3 (expected output):
a b c d e
0 11 55
1 22 66
tried with append but not getting desired result
import pandas as pd
l1 = ['a', 'b', 'c', 'd', 'e']
l2 = []
df1 = pd.DataFrame(l2, columns=l1)
l3 = ['c', 'e']
l4 = [[11, 55],
[22, 66]]
df2 = pd.DataFrame(l4, columns=l3)
print("concat","\n",pd.concat([df1,df2])) # columns will be inplace
print("merge Nan","\n",pd.merge(df2, df1,how='left', on=l3)) # columns occurence is not preserved
#### Output ####
#concat
a b c d e
0 NaN NaN 11 NaN 55
1 NaN NaN 22 NaN 66
#merge
c e a b d
0 11 55 NaN NaN NaN
1 22 66 NaN NaN NaN
Append seems to work for me. Does this not do what you want?
df1 = pd.DataFrame(columns=['a', 'b', 'c'])
print("df1: ")
print(df1)
df2 = pd.DataFrame(columns=['a', 'c'], data=[[0, 1], [2, 3]])
print("df2:")
print(df2)
print("df1.append(df2):")
print(df1.append(df2, ignore_index=True, sort=False))
Output:
df1:
Empty DataFrame
Columns: [a, b, c]
Index: []
df2:
a c
0 0 1
1 2 3
df1.append(df2):
a b c
0 0 NaN 1
1 2 NaN 3
Have you tried pd.concat ?
pd.concat([df1,df2])

pandas Dataframe, assign value based on selection of other rows

I have a pandas DataFrame in python 3.
In this DataFrame there are rows which have identical values in two columns (this can be whole sections), I'll call this a group.
Each row also has a True/False value in a column.
Now for each row I want to know if any of the rows in its group have a False value, if so, I want to assign a False value to every row in that group in another column. I've managed to do this in a for-loop, but it's quite slow:
import pandas as pd
import numpy as np
df = pd.DataFrame({'E': [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan],
'D': [0, 1, 2, 3, 4, 5, 6],
'C': [True, True, False, False, True, True, True],
'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
'A': [0, 0, 0, 0, 1, 1, 1]})
Which gives:
df:
A B C D E
0 0 aa True 0 NaN
1 0 aa True 1 NaN
2 0 aa False 2 NaN
3 0 bb False 3 NaN
4 1 cc True 4 NaN
5 1 dd True 5 NaN
6 1 dd True 6 NaN
Now I run the for-loop:
for i in df.index:
df.ix[i, 'E'] = df[(df['A'] == df.iloc[i]['A']) & (df['B'] == df.iloc[i]['B'])]['C'].all()
which then gives the desired result:
df:
A B C D E
0 0 aa True 0 False
1 0 aa True 1 False
2 0 aa False 2 False
3 0 bb False 3 False
4 1 cc True 4 True
5 1 dd True 5 True
6 1 dd True 6 True
When running this for my entire DataFrame of ~1 million rows this takes ages. So, looking at using .apply() to avoid a for-loop I've stumbled across the following question: apply a function to a pandas Dataframe whose retuned value is based on other rows
however:
def f(x): return False not in x
df.groupby(['A','B']).C.apply(f)
returns:
A B
0 aa False
bb True
1 cc True
dd True
Does anyone know a better way or how to fix the last case?
You could try doing a SQL-style join using pd.merge.
Perform the same groupby that you're doing, but apply min() to it to look for any cases with C == True. Then convert that to a DataFrame, rename the column as "E", and merge it back to df.
df = pd.DataFrame({'D': [0, 1, 2, 3, 4, 5, 6],
'C': [True, True, False, False, True, True, True],
'B': ['aa', 'aa', 'aa', 'bb', 'cc', 'dd', 'dd'],
'A': [0, 0, 0, 0, 1, 1, 1]})
falses = pd.DataFrame(df.groupby(['A', 'B']).C.min() == True)
falses = falses.rename(columns={'C': 'E'})
df = df.merge(falses, left_on=['A', 'B'], right_index=True)

Resources