Python: Compare 2 pandas dataframe with unequal number of rows - python-3.x

Need to compare two pandas dataframe with unequal number of rows and generate a new df with True for matching records and False for non matching and missing records.
df1:
date x y
0 2022-11-01 4 5
1 2022-11-02 12 5
2 2022-11-03 11 3
df2:
date x y
0 2022-11-01 4 5
1 2022-11-02 11 5
expected df_output:
date x y
0 True True True
1 False False False
2 False False False
Code:
df1 = pd.DataFrame({'date':['2022-11-01', '2022-11-02', '2022-11-03'],'x':[4,12,11],'y':[5,5,3]})
df2 = pd.DataFrame({'date':['2022-11-01', '2022-11-02'],'x':[4,11],'y':[5,5]})
df_output = pd.DataFrame(np.where(df1 == df2, True, False), columns=df1.columns)
print(df_output)
Error: ValueError: Can only compare identically-labeled DataFrame objects

You can use:
# cell to cell equality
# comparing by date
df3 = df1.eq(df1[['date']].merge(df2, on='date', how='left'))
# or to compare by index
# df3 = df1.eq(df2, axis=1)
# if you also want to turn a row to False if there is any False
df3 = (df3.T & df3.all(axis=1)).T
Output:
date x y
0 True True True
1 False False False
2 False False False

Related

I need to find the count values using a specific row that matches the columns for the other dataframe the values which are true

For example I have two dataframes
df = [{'id':1, 'A': 'False', 'B' : 'False' , 'C':'NA'},{'id':2, 'A': 'True', 'B' : 'False' , 'C':'NA'}]
df2 = [{'A':'True', 'B': 'False' , 'C':'NA', 'D': 'True'}]
idea should be to count the values based on the row of df which matches, df2 columns
df:
id
A
B
C
1
False
False
NA
2
True
False
NA
df2:
A
B
C
D
True
False
NA
True
Output:
id
A
B
C
Count
1
False
False
NA
0
2
True
False
NA
1
I tried something like
for i in range(columns):
x = action_value_counts_df.columns[i]
if compare_column.equals(action_value_counts_df[x]):
print(x, 'Matched')
else:
print(x,'Not Matched')
This code did not help
Merge df, df2 using overlapping columns.
Count overlapping columns.
Merge count result DataFrame into df.
Replace NA with 0 in the 'Count' column.
import pandas as pd
df = pd.DataFrame([
{'id':1, 'A': 'False', 'B' : 'False' , 'C':'NA'},
{'id':2, 'A': 'True', 'B' : 'False' , 'C':'NA'}])
df2 = pd.DataFrame([
{'A':'True', 'B': 'False' , 'C':'NA', 'D': 'True'}])
# 1
match_df = pd.merge(left=df, right=df2, on=['A', 'B', 'C'], how='inner')
match_df = match_df.assign(Count=1)
"""
id A B C D Count
0 2 True False NA True 1
"""
# 2
match_count_df = match_df.groupby(['A', 'B', 'C'], as_index=False).count()
match_count_df = match_count_df[['A', 'B', 'C', 'Count']]
"""
A B C Count
0 True False NA 1
"""
# 3
result_df = pd.merge(left=df, right=match_count_df, how='left')
"""
id A B C Count
0 1 False False NA NaN
1 2 True False NA 1.0
"""
# 4
result_df.loc[:, 'Count'] = result_df['Count'].fillna(0)
"""
id A B C Count
0 1 False False NA 0.0
1 2 True False NA 1.0
"""
You can compare two dataframes rowwise with arranged columns (from df to df2) and ignored NaN values (as they are not comparable):
df.assign(Count=df.set_index('id').apply(lambda x: (x.dropna() == df2[x.index].squeeze().dropna()).all() * 1, axis=1).values)
id A B C Count
0 1 False False NaN 0
1 2 True False NaN 1

I need to match the unique value of rows from one dataset to the columns matching in another dataset and provide the dataframe

Below is the dataframe example where id is the index
df:
id
A
B
C
1
False
False
NA
2
True
False
NA
3
False
True
True
df2:
A
B
C
D
True
False
NA
True
False
True
False
False
False
True
True
True
False
True
True
True
False
True
True
True
False
True
True
True
False
True
True
True
False
True
True
True
Output:
Here we are matching the unique row if the id of df matches with the columns of df2 and has true
values in df2 columns then sum it per id of df and provide the data frame of the same index and ignoring d column in df2
id
A
B
C
Sum of matched true values in columns of df2
1
False
False
NA
0
2
True
False
NA
2
3
False
True
True
6
match_df = try_df.merge(df, on= list_new , how='outer',suffixes=('', '_y'))
match_df.drop(match_df.filter(regex='_y$').columns, axis=1, inplace=True)
df_grouped = match_df.groupby('CIS Sub Controls')[list_new].agg(['sum', 'count'])
df_final = pd.concat([df_grouped['col1']['sum'], df_grouped['col2']['sum'], df_grouped['col3']['sum'], df_grouped['col4']['sum'], df_grouped['col1']['count'], df_grouped['col2']['count'], df_grouped['col3']['count'], df_grouped['col4']['count']], axis=1).join(df_grouped.index)
This is not how it goes
You can use value_counts and merge:
cols = df1.columns.intersection(df2.columns)
out = (df1.merge(df2[cols].value_counts(dropna=False).reset_index(name='sum'),
how='left')
.fillna({'sum': 0}, downcast='infer')
)
Output:
id A B C sum
0 1 False False NaN 0
1 2 True False NaN 1
2 3 False True True 6

Update a pandas dataframe

I have a pandas dataframe with multiple columns, I have to update a column with true or false based on a condition. Example the column names are price and result, if price column has promotion as value then result column should be updated as true or else false.
Please help me with this.
Given this df:
price result
0 promotion 0
1 1 0
2 4 0
3 3 0
You can do so:
df['result'] = np.where(df['price'] == 'promotion', True, False)
Output:
price result
0 promotion True
1 1 False
2 4 False
3 3 False
Lets suppose the dataframe looks like this:
price result
0 0 False
1 1 False
2 2 False
3 promotion False
4 3 False
5 promotion False
You can create two boolean arrays. The first one will have 'True' value at the indices where you want to set the 'True' value in result column and the second one will have 'True' values at the indices where you want to set 'False' value in result column.
Here is the code:
index_true = (df['price'] == 'promotion')
index_false = (df['price'] != 'promotion')
df.loc[index_true, 'result'] = True
df.loc[index_false, 'result'] = False
The resultant dataframe will look like this:
price result
0 0 False
1 1 False
2 2 False
3 promotion True
4 3 False
5 promotion True

delete rows based on first N columns

I have a datafame:
import pandas as pd
df= pd.DataFrame({'date':['2017-12-31','2018-02-01','2018-03-01'],'type':['Asset','Asset','Asset'],'Amount':[1,0,0],'Amount1':[1,0,0],'Ted':[1,0,0]})
df
I want to delete rows where the first three columns are 0. I don't want to use the name of the column as it changes. In this case, I want to delete the 2nd and 3rd rows.
Use boolean indexing:
df = df[df.iloc[:, :3].ne(0).any(axis=1)]
#alternative solution with inverting mask by ~
#df = df[~df.iloc[:, :3].eq(0).all(axis=1)]
print (df)
Amount Amount1 Ted date type
0 1 1 1 2017-12-31 Asset
Detail:
First select N columns by iloc:
print (df.iloc[:, :3])
Amount Amount1 Ted
0 1 1 1
1 0 0 0
2 0 0 0
Compare by ne (!=):
print (df.iloc[:, :3].ne(0))
Amount Amount1 Ted
0 True True True
1 False False False
2 False False False
Get all rows at least one True per row by any:
print (df.iloc[:, :3].ne(0).any(axis=1))
0 True
1 False
2 False
dtype: bool

How to use “na_values='?'” option in the pd.read.csv() function?

I am trying to find the operation with na_values='?' option in the pd.read.csv() function.
So that I can find the list of rows containing "?" value and then remove that value.
Sample:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv')
df = pd.read_csv(StringIO(temp))
print (df)
id col1 col2 col3
0 1 13? 15 14
1 1 13 15 ?
2 1 12 15 13
3 2 ? 15 ?
4 2 18 15 13
5 2 18? 15 13
If want remove values with ? which are separately or substrings need mask created by str.contains and then check if at least one True per row by DataFrame.any:
print (df.astype(str).apply(lambda x: x.str.contains('?', regex=False)))
id col1 col2 col3
0 False True False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False True False False
m = ~df.astype(str).apply(lambda x: x.str.contains('?', regex=False)).any(axis=1)
print (m)
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
2 1 12 15 13
4 2 18 15 13
If want replace only separately ? simply compare value:
print (df.astype(str) == '?')
id col1 col2 col3
0 False False False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False False False False
m = ~(df.astype(str) == '?').any(axis=1)
print (m)
0 True
1 False
2 True
3 False
4 True
5 True
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
0 1 13? 15 14
2 1 12 15 13
4 2 18 15 13
5 2 18? 15 13
It replace all ? to NaNs is necessary parameter na_values and dropna if want remove all rows with NaNs:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv', na_values='?')
df = pd.read_csv(StringIO(temp), na_values='?')
print (df)
id col1 col2 col3
0 1 13? 15 14.0
1 1 13 15 NaN
2 1 12 15 13.0
3 2 NaN 15 NaN
4 2 18 15 13.0
5 2 18? 15 13.0
df = df.dropna()
print (df)
id col1 col2 col3
0 1 13? 15 14.0
2 1 12 15 13.0
4 2 18 15 13.0
5 2 18? 15 13.0
na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('some-data.csv', na_values=na_values)
Create a list with useless parameters and use it trough reading from file
"??" or "####" type of junk values can be converted into missing value, since in python all the blank values can be replaced with nan. Hence you can also replace these type of junk value to missing value by passing them as as list to the parameter
'na_values'.
data_csv = pd.read_csv('test.csv',na_values = ["??"])
If you want to remove the rows which are contain "?" in pandas dataframe, you can try with:
suppose you have df:
import pandas as pd
df = pd.read_csv('test.csv')
df:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
3 test?dsfsa 9/15/2016
check if column A contain "?" to generate new df1:
df1 = df[df.A.str.contains("\?")==False]
df1 will be:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
which will give you the new df1 which doesn't contain "?".

Resources