I have 3 dataframes (df1, df2, df3), out of which 'df3' might be created or not. If the dataframe df3 is created then merge all the three else just merge df1 & df2.
I am trying the below code:
df1 = pd.DataFrame([['a',1,2],['b',4,5],['c',7,8],[np.NaN,10,11]], columns=['id','x','y'])
df2 = pd.DataFrame([['a',1,2],['b',4,5],['c',7,10],[np.NaN,10,11]], columns=['id','x','y'])
df3 = pd.DataFrame([['g',1,2],['h',4,5],['i',7,10],[np.NaN,10,11]], columns=['id','x','y'])
if not isinstance(df3, type(None)):
df1.append(df2)
else:
df1.append(df2).append(df3)
It is giving me "NameError: name 'df3' is not defined" error if df3 doesnot exist
This answer might have the key you're looking for: https://stackoverflow.com/a/1592578
df1.append(df2)
try:
df1.append(df3)
except NameError:
pass # df3 does not exist
Related
i need one help for the below requirement. this is just for sample data. i have more than 200 columns in each data frame in real time use case. i need to compare two data frames and flag the differences.
df1
id,name,city
1,abc,pune
2,xyz,noida
df2
id,name,city
1,abc,pune
2,xyz,bangalore
3,kk,mumbai
expected dataframe
id,name,city,flag
1,abc,pune,same
2,xyz,bangalore,update
3,kk,mumbai,new
can someone please help me to build the logic in pyspark?
Thanks in advance.
Pyspark's hash function can help with identifying the records that are different.
https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.functions.hash.html
from pyspark.sql.functions import col, hash
df1 = df1.withColumn('hash_value', hash('id', 'name', 'city')
df2 = df2.withColumn('hash_value', hash('id', 'name', 'city')
df_updates = df1 .alias('a').join(df2.alias('b'), (\
(col('a.id') == col('b.id')) &\
(col('a.hash_value') != col('b.hash_value')) \
) , how ='inner'
)
df_updates = df_updates.select(b.*)
Once you have identified the records that are different.
Then you would be able to setup a function that can loop through each column in the df to compare that columns value.
Something like this should work
def add_change_flags(df1, df2):
df_joined = df1.join(df2, 'id', how='inner')
for column in df1.columns:
df_joined = df_joined.withColumn(column + "_change_flag", \
when(col(f"df1.{column}") === col(f"df2.{column}"),True)\
.otherwise(False))
return df_joined
Say I have a dataframe df, and a directory ./ which has the following excel files inside:
path = './'
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(('.xls', '.xlsx')):
print(os.path.join(root, file))
# dfs.append(read_dfs(os.path.join(root, file)))
# df = reduce(lambda left, right: pd.concat([left, right], axis = 0), dfs)
Out:
df1.xlsx,
df2.xlsx,
df3.xls
...
I want to merge df with all files from path based on common columns date and city. It works with the following code, but it's not concise enough.
So I raise a question for improving the code, thank you.
df = pd.merge(df, df1, on = ['date', 'city'], how='left')
df = pd.merge(df, df2, on = ['date', 'city'], how='left')
df = pd.merge(df, df3, on = ['date', 'city'], how='left')
...
Reference:
pandas three-way joining multiple dataframes on columns
The following code may works:
from functools import reduce
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left, right: pd.merge(left, right, on=['date', 'city']), dfs)
When i am trying to use df.at fuction without loop it works fine and change the data for a perticular column but it is giving error while using this in a loop.
Code is here.
import pandas as pd
data1 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [5.1, 6.2, 5.1, 5.2]}
df1 = pd.DataFrame(data1)
data2 = {'Name': ['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Height': [4.1, 3.4, 7.1, 9.2]}
df2 = pd.DataFrame(data2)
df3 = pd.concat([df1, df2], axis=1)
for i in range(int(len(df1))):
for j in range(int(len(df2))):
if df1['Name'][i] != df2['Name'][j]:
continue
else:
out = (df1['Height'][i] - df2['Height'][j])
df3.at[i, 'Height_Comparison'] = out
break
print(df3)
The issue was occurring becz of duplicate column names('Name', 'Height') in Data Frame df3 becz of the concat operation. Concat make double entries with same column names ('Name', 'Height') in Data Frame df3 which is creating this problem.
once i changed the column names to Name1, Height1 in df1 and Name2, Heigh2 in df2 the issue got resolved.
I have a multiple pandas dataframe. I want empty each dataframes like below
df1 = pd.DataFrame()
df2 = pd.DataFrame()
Instead of doing it individually, is there any way to do it in one line of code.
If I understood correctly, this will work:
df_list = []
for i in range (0,10):
df = pd.DataFrame()
df_list.append(df)
print(df_list[0].head())
I want to merge two dataframes together and then delete the first one to create space in RAM.
df1 = pd.read_csv(filepath, index_col='False')
df2 = pd.read_csv(filepath, index_col='False')
df3 = pd.read_csv(filepath, index_col='False')
df4 = pd.read_csv(filepath, index_col='False')
result = df1.merge(df2, on='column1', how='left', left_index='True', copy='False')
result2 = result.merge(df3, on='column1', how='left', left_index='True', copy='False')
Ideally what I would like to do after this is delete all of df1, df2, df3 and have the result2 dataframe left.
It's better NOT to produce unnecessary DFs:
file_list = glob.glob('/path/to/file_mask*.csv')
df = pd.read_csv(file_list[0], index_col='False')
for f in file_list[1:]:
df = df.merge(pd.read_csv(f, index_col='False'), on='column1', how='left')
PS IMO you can't (at least shouldn't) mix up on and left_index parameters. Maybe you meant right_on and left_index - that would be OK
Just use del
del df1, df2, df3, df4