return columns where dataframes differ in values - python-3.x

I have two dataframes like the df1 and df2 examples below. I would like to compare values between the dataframes, and return the columns where the dataframes have different values in the column. So in the example below it would return column B. Any tips are greatly apreciated.
df1
A B C
1 2 3
1 1 1
df2
A B C
1 1 3
1 1 1

Comparing dataframes using != or ne() return a boolean dataframe on which you can look for any True values using any(). This returns a boolean series which you can index with itself.
s = (df1 != df2).any()
s[s].index
Index(['B'], dtype='object')

In your above example using eq with all
df1.eq(df2).all().loc[lambda x : ~x].index
Out[720]: Index(['B'], dtype='object')

Related

How to create a generalized function in Python to compare consecutive columns in pandas df for more that 100 columns

I have two data frames DF1 and DF2 with more that 280+ columns in both df, I have to compare both df on some unique key , so I am merging both df first ,
Compare = pd.merge(df1,df2,how='outer',on='unique_key',suffix=('_X','_Y'))
now , I want to compare consecutive columns like
Compare['compare_1'] = Compare[_X]==Compare[_Y].
But, since columns are more than 280+ so I cant create compare columns for each set individually, I am looking for a function which can compare these consecutive columns.
I tried something like this,
col=df.columns
for x,i in enumerate(col):
for y,j in enumerate(col):
if y-x==1 and i!=j:
bina = df[i]-df[j]
df['MOM_' + str(j) + '_' + str(i)] = bina
But , it is not working as my df are huge more that 100k records and loops are making it complex.
IIUC use:
df1 = Compare.like('_X')
df1.columns = df1.columns.str.replace('_X$','')
df2 = Compare.like('_Y')
df2.columns = df2.columns.str.replace('_Y$','')
out = Compare.join(df1.sub(df2).add_prefix('MOM_'))

How to compare two different dataframes in Pandas and set the corresponding values?

I have two different size dataframes:
df1:
Name All
L_LV-SWB_1 10.300053
L_SWB_1-SWB_2 6.494196
L_SWB_2-SWB_3 4.738036
df2:
I want to create a new column in df2 called 'Isc' which contains the numerical values of df1, only if the 'On Element' column in df2 matches the 'Name' column in df1. Otherwise a 0 value is set.
First, match the name of the column with the same value,
df1.columns = ['On element','All']
print(df1)
On element All
0 L_LV-SWB_1 10.300053
1 L_SWB_1-SWB_2 6.494196
2 L_SWB_2-SWB_3 4.738036
You just need to merge that value(On element) as a key.
df = pd.merge(df2,df1,on='On element', how='left').fillna(0)
print(df)
Name On element All
0 CB_1-1 L_LV-SWB_1 10.300053
1 CB_1-2 L_SWB_1-SWB_2 6.494196
2 CB_2 L_LV-SWB_1 10.300053
3 CB_2-1 L_SWB_1-SWB_3 0

Is there a way to compare the values of a Pandas DataFrame with the values of a second DataFrame?

I have 2 Pandas Dataframes with 5 columns and about 1000 rows each (working with python3).
I'm interested in making a comparison between the first column in df1 and the first column of df2 as follows:
DF1
[index] [col1]
1 "foobar"
2 "acksyn"
3 "foobaz"
4 "ackfin"
... ...
DF2
[index] [col1]
1 "old"
2 "fin"
3 "new"
4 "bar"
... ...
What I want to achieve is this: for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.
In this example the resulting DF1 should be:
DF1
[index] [col1]
2 "acksyn"
3 "foobaz"
... ...
(see DF2 indexes 2 and 4 are the final part in DF1 indexes 1 and 4)
I tried using an internally defined function like:
def check_presence(df1_col1, second_csv):
for index, row in second_csv.iterrows():
search_string = "(?P<first_group>^(" + some_string + "))(?P<the_rest>" + row["col1"] + "$)"
if re.search(search_string, df1_col1):
return True
return False
and instructions with this format:
indexes = csv[csv.col1.str.contains(some_regex, regex= True, na=False)].index
but in both cases the python console complies about not being able to compare non-string objects with a string
What am I doing wrong? I can even try a solution after joining the 2 CSVs but I think I would need to do the same thing in the end
Thanks for patience, I'm new to python...
You will need to join your keywords in df2 first if you want to use str.contains method.
import pandas as pd
df = pd.DataFrame({'col1': {0: 'foobar', 1: 'acksyn', 2: 'foobaz', 3: 'ackfin'}})
df2 = pd.DataFrame({'col1': {0: 'old', 1: 'fin', 2: 'new', 3: 'bar'}})
print (df["col1"].str.contains("|".join(df2["col1"])))
#
0 True
1 False
2 False
3 True
Possible Solution
"" for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.""
This is a one-liner if I understand properly:
# Search for Substring
# Generate an "OR" statement with a join
# Drop if match.
df[~df.col1.str.contains('|'.join(df2.col1.values))]
This will keep only the rows where DF2.Col1 is NOT found in DF1.Col1.
pd.Series.str.contains
Take your frames
frame1 =frame1=pd.DataFrame({"col1":["foobar","acksyn","foobaz","ackfin"]})
frame2=pd.DataFrame({"col1":["old","fin","new","bar"]})
Then
myList=frame2.col2.values
pattern='|'.join(myList)
Finally
frame1["col2"]=frame1["col1"].str.contains(pattern)
frame1.loc[frame1["col2"]==True]
col1 col2
0 foobar True
3 ackfin True

Split rows with same ID into different columns python

I want to have a dataframe with repeated values with the same id number. But i want to split the repeated rows into colunms.
data = [[10450015,4.4],[16690019 4.1],[16690019,4.0],[16510069 3.7]]
df = pd.DataFrame(data, columns = ['id', 'k'])
print(df)
The resulting dataframe would have n_k (n= repated values of id rows). The repeated id gets a individual colunm and when it does not have repeated id, it gets a 0 in the new colunm.
data_merged = {'id':[10450015,16690019,16510069], '1_k':[4.4,4.1,3.7], '2_k'[0,4.0,0]}
print(data_merged)
Try assiging the column idx ref, using DataFrame.assign and groupby.cumcount then DataFrame.pivot_table. Finally use a list comprehension to sort column names:
df_new = (df.assign(col=df.groupby('id').cumcount().add(1))
.pivot_table(index='id', columns='col', values='k', fill_value=0))
df_new.columns = [f"{x}_k" for x in df_new.columns]
print(df_new)
1_k 2_k
id
10450015 4.4 0
16510069 3.7 0
16690019 4.1 4

How to drop single valued columns effiecently from dataframe

How to drop all columns which have single value from a dataframe effiecntly ?
I found two ways:
This method ignores null and only considers other values, I need to consider nulls in my case
# apply countDistinct on each column
col_counts = partsDF.agg(*(countDistinct(col(c)).alias(c) for c in partsDF.columns)).collect()[0].asDict()
this method takes too long time
col_counts = partsDF.agg(*( partsDF.select(c).distinct().count() for c in partsDF.columns)).collect()[0].asDict()
#select the cols with count=1 in an array
cols_to_drop = [col for col in partsDF.columns if col_counts[col] == 1 ]

Resources