Merge based on multiple columns of all excel files from a directory in Python - python-3.x

Say I have a dataframe df, and a directory ./ which has the following excel files inside:
path = './'
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(('.xls', '.xlsx')):
print(os.path.join(root, file))
# dfs.append(read_dfs(os.path.join(root, file)))
# df = reduce(lambda left, right: pd.concat([left, right], axis = 0), dfs)
Out:
df1.xlsx,
df2.xlsx,
df3.xls
...
I want to merge df with all files from path based on common columns date and city. It works with the following code, but it's not concise enough.
So I raise a question for improving the code, thank you.
df = pd.merge(df, df1, on = ['date', 'city'], how='left')
df = pd.merge(df, df2, on = ['date', 'city'], how='left')
df = pd.merge(df, df3, on = ['date', 'city'], how='left')
...
Reference:
pandas three-way joining multiple dataframes on columns

The following code may works:
from functools import reduce
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left, right: pd.merge(left, right, on=['date', 'city']), dfs)

Related

Extracting Data From Pandas DataFrame

I have two pandas dataframe named df1 and df2. I want to extract same named files from both of the dataframe and put extracted in two columns in a data frame. I want the take, files name from df1 and match with df2 (df2 has more files than df1). There is only one column in both dataframe (df1 and df2). The "BOLD" one started with letter s**** is the common matching alpha-numeric characters. We have to match both dataframe on that.
df1["Text_File_Location"] =
0 /home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt
1 /home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt
df2["Image_File_Location"]=
0 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg'
1 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg
In Python 3.4+, you can use pathlib to handily work with filepaths. You can extract the filename without extension ("stem") from df1 and then you can extract the parent folder name from df2. Then, you can do an inner merge on those names.
import pandas as pd
from pathlib import Path
df1 = pd.DataFrame(
{
"Text_File_Location": [
"/home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt",
"/home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt",
]
}
)
df2 = pd.DataFrame(
{
"Image_File_Location": [
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg",
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg",
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/foo/bar.jpg",
]
}
)
df1["name"] = df1["Text_File_Location"].apply(lambda x: Path(str(x)).stem)
df2["name"] = df2["Image_File_Location"].apply(lambda x: Path(str(x)).parent.name)
df3 = pd.merge(df1, df2, on="name", how="inner")

Better way to swap column values and then append them in a pandas dataframe?

here is my dataframe
import pandas as pd
data = {'from':['Frida', 'Frida', 'Frida', 'Pablo','Pablo'], 'to':['Vincent','Pablo','Andy','Vincent','Andy'],
'score':[2, 2, 1, 1, 1]}
df = pd.DataFrame(data)
df
I want to swap the values in columns 'from' and 'to' and add them on because these scores work both ways.. here is what I have tried.
df_copy = df.copy()
df_copy.rename(columns={"from":"to","to":"from"}, inplace=True)
df_final = df.append(df_copy)
which works but is there a shorter way to do the same?
One line could be :
df_final = df.append(df.rename(columns={"from":"to","to":"from"}))
On the right track. However, introduce deep=True to make a true copy, otherwise your df.copy will just update df and you will be up in a circle.
df_copy = df.copy(deep=True)
df_copy.rename(columns={"from":"to","to":"from"}, inplace=True)
df_final = df.append(df_copy)

looping through list of pandas dataframes and make it empty dataframe

I have a multiple pandas dataframe. I want empty each dataframes like below
df1 = pd.DataFrame()
df2 = pd.DataFrame()
Instead of doing it individually, is there any way to do it in one line of code.
If I understood correctly, this will work:
df_list = []
for i in range (0,10):
df = pd.DataFrame()
df_list.append(df)
print(df_list[0].head())

Pandas: Join multiple data frame on the same keys

I need to join 5 data frames using the same key. I created several temporary data frame while doing the join. The code below works fine, but I am wondering is there a more elegant way to achieve this goal? Thanks!
df1 = pd.read_pickle('df1.pkl')
df2 = pd.read_pickle('df2.pkl')
df3 = pd.read_pickle('df3.pkl')
df4 = pd.read_pickle('df4.pkl')
df5 = pd.read_pickle('df5.pkl')
tmp_1 = pd.merge(df1, df2, how ='outer', on = ['id','week'])
tmp_2 = pd.merge(tmp_1, df3, how ='outer', on = ['id','week'])
tmp_3 = pd.merge(tmp_2, df4, how ='outer', on = ['id','week'])
result_df = pd.merge(tmp_3, df5, how ='outer', on = ['id','week'])
Use pd.concat after setting the index
dfs = [df1, df2, df3, df4, df5]
cols = ['id', 'weedk']
df = pd.concat([d.set_index(cols) for d in dfs], axis=1).reset_index()
Include file reading
from glob import glob
def rp(f):
return pd.read_pickle(f).set_index(['id', 'week'])
df = pd.concat([rp(f) for f in glob('df[1-5].pkl')], axis=1).reset_index()

Drop previous pandas tables after merged into 1

I want to merge two dataframes together and then delete the first one to create space in RAM.
df1 = pd.read_csv(filepath, index_col='False')
df2 = pd.read_csv(filepath, index_col='False')
df3 = pd.read_csv(filepath, index_col='False')
df4 = pd.read_csv(filepath, index_col='False')
result = df1.merge(df2, on='column1', how='left', left_index='True', copy='False')
result2 = result.merge(df3, on='column1', how='left', left_index='True', copy='False')
Ideally what I would like to do after this is delete all of df1, df2, df3 and have the result2 dataframe left.
It's better NOT to produce unnecessary DFs:
file_list = glob.glob('/path/to/file_mask*.csv')
df = pd.read_csv(file_list[0], index_col='False')
for f in file_list[1:]:
df = df.merge(pd.read_csv(f, index_col='False'), on='column1', how='left')
PS IMO you can't (at least shouldn't) mix up on and left_index parameters. Maybe you meant right_on and left_index - that would be OK
Just use del
del df1, df2, df3, df4

Resources