I want to merge two dataframes together and then delete the first one to create space in RAM.
df1 = pd.read_csv(filepath, index_col='False')
df2 = pd.read_csv(filepath, index_col='False')
df3 = pd.read_csv(filepath, index_col='False')
df4 = pd.read_csv(filepath, index_col='False')
result = df1.merge(df2, on='column1', how='left', left_index='True', copy='False')
result2 = result.merge(df3, on='column1', how='left', left_index='True', copy='False')
Ideally what I would like to do after this is delete all of df1, df2, df3 and have the result2 dataframe left.
It's better NOT to produce unnecessary DFs:
file_list = glob.glob('/path/to/file_mask*.csv')
df = pd.read_csv(file_list[0], index_col='False')
for f in file_list[1:]:
df = df.merge(pd.read_csv(f, index_col='False'), on='column1', how='left')
PS IMO you can't (at least shouldn't) mix up on and left_index parameters. Maybe you meant right_on and left_index - that would be OK
Just use del
del df1, df2, df3, df4
Related
There are three data frames, df_1, df_2 and df_3. I combined them as follows
result1 = df_1.append(df_2,ignore_index=True)
result2 = result1.append(df_3,ignore_index=True)
Then result2 is the combined dataframe. This code segment current works fine if neither of these three input data frames is empty.
However, in practice, any of these three input data frames can be empty. What is the most efficient approach to handle these different scenarios without implementing complex if-else logic to evaluate different scenarios, e.g., df_1 is empty, or both df_1 and df_3 are empty, etc.
IIUC use concat with list of Dataframes, it working if all or any DataFrame(s) are empty:
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_3 = pd.DataFrame()
df = pd.concat([df_1, df_2, df_3],ignore_index=True)
print (df)
Empty DataFrame
Columns: []
Index: []
df_1 = pd.DataFrame()
df_2 = pd.DataFrame({'a':[1,2]})
df_3 = pd.DataFrame({'a':[10,20]})
df = pd.concat([df_1, df_2, df_3],ignore_index=True)
print (df)
a
0 1
1 2
2 10
3 20
df_1 = pd.DataFrame()
df_2 = pd.DataFrame({'a':[1,2]})
df_3 = pd.DataFrame()
df = pd.concat([df_1, df_2, df_3],ignore_index=True)
print (df)
a
0 1
1 2
Say I have a dataframe df, and a directory ./ which has the following excel files inside:
path = './'
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(('.xls', '.xlsx')):
print(os.path.join(root, file))
# dfs.append(read_dfs(os.path.join(root, file)))
# df = reduce(lambda left, right: pd.concat([left, right], axis = 0), dfs)
Out:
df1.xlsx,
df2.xlsx,
df3.xls
...
I want to merge df with all files from path based on common columns date and city. It works with the following code, but it's not concise enough.
So I raise a question for improving the code, thank you.
df = pd.merge(df, df1, on = ['date', 'city'], how='left')
df = pd.merge(df, df2, on = ['date', 'city'], how='left')
df = pd.merge(df, df3, on = ['date', 'city'], how='left')
...
Reference:
pandas three-way joining multiple dataframes on columns
The following code may works:
from functools import reduce
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left, right: pd.merge(left, right, on=['date', 'city']), dfs)
I have a multiple pandas dataframe. I want empty each dataframes like below
df1 = pd.DataFrame()
df2 = pd.DataFrame()
Instead of doing it individually, is there any way to do it in one line of code.
If I understood correctly, this will work:
df_list = []
for i in range (0,10):
df = pd.DataFrame()
df_list.append(df)
print(df_list[0].head())
I have 3 dataframes (df1, df2, df3), out of which 'df3' might be created or not. If the dataframe df3 is created then merge all the three else just merge df1 & df2.
I am trying the below code:
df1 = pd.DataFrame([['a',1,2],['b',4,5],['c',7,8],[np.NaN,10,11]], columns=['id','x','y'])
df2 = pd.DataFrame([['a',1,2],['b',4,5],['c',7,10],[np.NaN,10,11]], columns=['id','x','y'])
df3 = pd.DataFrame([['g',1,2],['h',4,5],['i',7,10],[np.NaN,10,11]], columns=['id','x','y'])
if not isinstance(df3, type(None)):
df1.append(df2)
else:
df1.append(df2).append(df3)
It is giving me "NameError: name 'df3' is not defined" error if df3 doesnot exist
This answer might have the key you're looking for: https://stackoverflow.com/a/1592578
df1.append(df2)
try:
df1.append(df3)
except NameError:
pass # df3 does not exist
I need to join 5 data frames using the same key. I created several temporary data frame while doing the join. The code below works fine, but I am wondering is there a more elegant way to achieve this goal? Thanks!
df1 = pd.read_pickle('df1.pkl')
df2 = pd.read_pickle('df2.pkl')
df3 = pd.read_pickle('df3.pkl')
df4 = pd.read_pickle('df4.pkl')
df5 = pd.read_pickle('df5.pkl')
tmp_1 = pd.merge(df1, df2, how ='outer', on = ['id','week'])
tmp_2 = pd.merge(tmp_1, df3, how ='outer', on = ['id','week'])
tmp_3 = pd.merge(tmp_2, df4, how ='outer', on = ['id','week'])
result_df = pd.merge(tmp_3, df5, how ='outer', on = ['id','week'])
Use pd.concat after setting the index
dfs = [df1, df2, df3, df4, df5]
cols = ['id', 'weedk']
df = pd.concat([d.set_index(cols) for d in dfs], axis=1).reset_index()
Include file reading
from glob import glob
def rp(f):
return pd.read_pickle(f).set_index(['id', 'week'])
df = pd.concat([rp(f) for f in glob('df[1-5].pkl')], axis=1).reset_index()