Say I have a dataframe df, and a directory ./ which has the following excel files inside:
path = './'
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(('.xls', '.xlsx')):
print(os.path.join(root, file))
# dfs.append(read_dfs(os.path.join(root, file)))
# df = reduce(lambda left, right: pd.concat([left, right], axis = 0), dfs)
Out:
df1.xlsx,
df2.xlsx,
df3.xls
...
I want to merge df with all files from path based on common columns date and city. It works with the following code, but it's not concise enough.
So I raise a question for improving the code, thank you.
df = pd.merge(df, df1, on = ['date', 'city'], how='left')
df = pd.merge(df, df2, on = ['date', 'city'], how='left')
df = pd.merge(df, df3, on = ['date', 'city'], how='left')
...
Reference:
pandas three-way joining multiple dataframes on columns
The following code may works:
from functools import reduce
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left, right: pd.merge(left, right, on=['date', 'city']), dfs)
Related
I have two pandas dataframe named df1 and df2. I want to extract same named files from both of the dataframe and put extracted in two columns in a data frame. I want the take, files name from df1 and match with df2 (df2 has more files than df1). There is only one column in both dataframe (df1 and df2). The "BOLD" one started with letter s**** is the common matching alpha-numeric characters. We have to match both dataframe on that.
df1["Text_File_Location"] =
0 /home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt
1 /home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt
df2["Image_File_Location"]=
0 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg'
1 /media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg
In Python 3.4+, you can use pathlib to handily work with filepaths. You can extract the filename without extension ("stem") from df1 and then you can extract the parent folder name from df2. Then, you can do an inner merge on those names.
import pandas as pd
from pathlib import Path
df1 = pd.DataFrame(
{
"Text_File_Location": [
"/home/mzkhan/2.0.0/files/p15/p15546261/s537061.txt",
"/home/mzkhan/2.0.0/files/p15/p15098455/s586955.txt",
]
}
)
df2 = pd.DataFrame(
{
"Image_File_Location": [
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s537061/02aa804e- bde0afdd-112c0b34-7bc16630-4e384014.jpg",
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/s586955/174413ec-4ec4c1f7-34ea26b7-c5f994f8-79ef1962.jpg",
"/media/mzkhan/external_dr/radiology_image/2.0.0/files/p10/p10000032/foo/bar.jpg",
]
}
)
df1["name"] = df1["Text_File_Location"].apply(lambda x: Path(str(x)).stem)
df2["name"] = df2["Image_File_Location"].apply(lambda x: Path(str(x)).parent.name)
df3 = pd.merge(df1, df2, on="name", how="inner")
here is my dataframe
import pandas as pd
data = {'from':['Frida', 'Frida', 'Frida', 'Pablo','Pablo'], 'to':['Vincent','Pablo','Andy','Vincent','Andy'],
'score':[2, 2, 1, 1, 1]}
df = pd.DataFrame(data)
df
I want to swap the values in columns 'from' and 'to' and add them on because these scores work both ways.. here is what I have tried.
df_copy = df.copy()
df_copy.rename(columns={"from":"to","to":"from"}, inplace=True)
df_final = df.append(df_copy)
which works but is there a shorter way to do the same?
One line could be :
df_final = df.append(df.rename(columns={"from":"to","to":"from"}))
On the right track. However, introduce deep=True to make a true copy, otherwise your df.copy will just update df and you will be up in a circle.
df_copy = df.copy(deep=True)
df_copy.rename(columns={"from":"to","to":"from"}, inplace=True)
df_final = df.append(df_copy)
I have a multiple pandas dataframe. I want empty each dataframes like below
df1 = pd.DataFrame()
df2 = pd.DataFrame()
Instead of doing it individually, is there any way to do it in one line of code.
If I understood correctly, this will work:
df_list = []
for i in range (0,10):
df = pd.DataFrame()
df_list.append(df)
print(df_list[0].head())
I need to join 5 data frames using the same key. I created several temporary data frame while doing the join. The code below works fine, but I am wondering is there a more elegant way to achieve this goal? Thanks!
df1 = pd.read_pickle('df1.pkl')
df2 = pd.read_pickle('df2.pkl')
df3 = pd.read_pickle('df3.pkl')
df4 = pd.read_pickle('df4.pkl')
df5 = pd.read_pickle('df5.pkl')
tmp_1 = pd.merge(df1, df2, how ='outer', on = ['id','week'])
tmp_2 = pd.merge(tmp_1, df3, how ='outer', on = ['id','week'])
tmp_3 = pd.merge(tmp_2, df4, how ='outer', on = ['id','week'])
result_df = pd.merge(tmp_3, df5, how ='outer', on = ['id','week'])
Use pd.concat after setting the index
dfs = [df1, df2, df3, df4, df5]
cols = ['id', 'weedk']
df = pd.concat([d.set_index(cols) for d in dfs], axis=1).reset_index()
Include file reading
from glob import glob
def rp(f):
return pd.read_pickle(f).set_index(['id', 'week'])
df = pd.concat([rp(f) for f in glob('df[1-5].pkl')], axis=1).reset_index()
I want to merge two dataframes together and then delete the first one to create space in RAM.
df1 = pd.read_csv(filepath, index_col='False')
df2 = pd.read_csv(filepath, index_col='False')
df3 = pd.read_csv(filepath, index_col='False')
df4 = pd.read_csv(filepath, index_col='False')
result = df1.merge(df2, on='column1', how='left', left_index='True', copy='False')
result2 = result.merge(df3, on='column1', how='left', left_index='True', copy='False')
Ideally what I would like to do after this is delete all of df1, df2, df3 and have the result2 dataframe left.
It's better NOT to produce unnecessary DFs:
file_list = glob.glob('/path/to/file_mask*.csv')
df = pd.read_csv(file_list[0], index_col='False')
for f in file_list[1:]:
df = df.merge(pd.read_csv(f, index_col='False'), on='column1', how='left')
PS IMO you can't (at least shouldn't) mix up on and left_index parameters. Maybe you meant right_on and left_index - that would be OK
Just use del
del df1, df2, df3, df4