Pandas: Join multiple data frame on the same keys - python-3.x

I need to join 5 data frames using the same key. I created several temporary data frame while doing the join. The code below works fine, but I am wondering is there a more elegant way to achieve this goal? Thanks!
df1 = pd.read_pickle('df1.pkl')
df2 = pd.read_pickle('df2.pkl')
df3 = pd.read_pickle('df3.pkl')
df4 = pd.read_pickle('df4.pkl')
df5 = pd.read_pickle('df5.pkl')
tmp_1 = pd.merge(df1, df2, how ='outer', on = ['id','week'])
tmp_2 = pd.merge(tmp_1, df3, how ='outer', on = ['id','week'])
tmp_3 = pd.merge(tmp_2, df4, how ='outer', on = ['id','week'])
result_df = pd.merge(tmp_3, df5, how ='outer', on = ['id','week'])

Use pd.concat after setting the index
dfs = [df1, df2, df3, df4, df5]
cols = ['id', 'weedk']
df = pd.concat([d.set_index(cols) for d in dfs], axis=1).reset_index()
Include file reading
from glob import glob
def rp(f):
return pd.read_pickle(f).set_index(['id', 'week'])
df = pd.concat([rp(f) for f in glob('df[1-5].pkl')], axis=1).reset_index()

Related

Merge based on multiple columns of all excel files from a directory in Python

Say I have a dataframe df, and a directory ./ which has the following excel files inside:
path = './'
for root, dirs, files in os.walk(path):
for file in files:
if file.endswith(('.xls', '.xlsx')):
print(os.path.join(root, file))
# dfs.append(read_dfs(os.path.join(root, file)))
# df = reduce(lambda left, right: pd.concat([left, right], axis = 0), dfs)
Out:
df1.xlsx,
df2.xlsx,
df3.xls
...
I want to merge df with all files from path based on common columns date and city. It works with the following code, but it's not concise enough.
So I raise a question for improving the code, thank you.
df = pd.merge(df, df1, on = ['date', 'city'], how='left')
df = pd.merge(df, df2, on = ['date', 'city'], how='left')
df = pd.merge(df, df3, on = ['date', 'city'], how='left')
...
Reference:
pandas three-way joining multiple dataframes on columns
The following code may works:
from functools import reduce
dfs = [df0, df1, df2, dfN]
df_final = reduce(lambda left, right: pd.merge(left, right, on=['date', 'city']), dfs)

Resetting multi index for a pivot_table to get single line index

I have a dataframe in the following format:
df = pd.DataFrame({'a':['1-Jul', '2-Jul', '3-Jul', '1-Jul', '2-Jul', '3-Jul'], 'b':[1,1,1,2,2,2], 'c':[3,1,2,4,3,2]})
I need the following dataframe:
df_new = pd.DataFrame({'a':['1-Jul', '2-Jul', '3-Jul'], 1:[3, 1, 2], 2:[4,3,2]}).
I have tried the following:
df = df.pivot_table(index = ['a'], columns = ['b'], values = ['c'])
df_new = df.reset_index()
but it doesn't give me the required result. I have tried variations of this to no avail. Any help will be greatly appreciated.
try this one:
df2 = df.groupby('a')['c'].agg(['first','last']).reset_index()
cols_ = df['b'].unique().tolist()
cols_.insert(0,df.columns[0])
df2.columns = cols_
df2

Generating multiple columns dynamically using loop in pyspark dataframe

I have a requirement where I have to generate multiple columns dynamically in pyspark. I have written a similar code as below to accomplish the same.
sc = SparkContext()
sqlContext = SQLContext(sc)
cols = ['a','b','c']
df = sqlContext.read.option("header","true").option("delimiter", "|").csv("C:\\Users\\elkxsnk\\Desktop\\sample.csv")
for i in cols:
df1 = df.withColumn(i,lit('hi'))
df1.show()
However I am missing out columns a and b in the final result. Please help.
Changed the code like below. its working now, but wanted to know if there is a better way of handling it.
cols = ['a','b','c']
cols_add = []
flg_first = 'Y'
df = sqlContext.read.option("header","true").option("delimiter", "|").csv("C:\\Users\\elkxsnk\\Desktop\\sample.csv")
for i in cols:
print('start'+str(df.columns))
if flg_first == 'Y':
df1 = df.withColumn(i,lit('hi'))
cols_add.append(i)
flg_first = 'N'
else:enter code here
df1 = df1.select(df.columns+cols_add).withColumn(i,lit('hi'))
cols_add.append(i)
print('end' + str(df1.columns))
df1.show()

PySpark: Replace Punctuations with Space Looping Through Columns

I have the following code running successfully in PySpark:
def pd(data):
df = data
df = df.select('oproblem')
text_col = ['oproblem']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
But when I add a second column in and try to loop it, it doesn't work:
def pd(data):
df = data
df = df.select('oproblem', 'lca')
text_col = ['oproblem', 'lca']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
Below is the error I get:
TypeError: 'Column' object is not callable
I think it should be df = df.select(['oproblem', 'lca']) instead of df = df.select('oproblem', 'lca').
Better yet for code quality purposes, have the select statement use the text_columns variable, so you only have to change 1 line of code if you need to do this with more columns or if your column names change. Eg,
def pd(data):
df = data
text_col = ['oproblem', 'lca']
df = df.select(text_col)
....

Drop previous pandas tables after merged into 1

I want to merge two dataframes together and then delete the first one to create space in RAM.
df1 = pd.read_csv(filepath, index_col='False')
df2 = pd.read_csv(filepath, index_col='False')
df3 = pd.read_csv(filepath, index_col='False')
df4 = pd.read_csv(filepath, index_col='False')
result = df1.merge(df2, on='column1', how='left', left_index='True', copy='False')
result2 = result.merge(df3, on='column1', how='left', left_index='True', copy='False')
Ideally what I would like to do after this is delete all of df1, df2, df3 and have the result2 dataframe left.
It's better NOT to produce unnecessary DFs:
file_list = glob.glob('/path/to/file_mask*.csv')
df = pd.read_csv(file_list[0], index_col='False')
for f in file_list[1:]:
df = df.merge(pd.read_csv(f, index_col='False'), on='column1', how='left')
PS IMO you can't (at least shouldn't) mix up on and left_index parameters. Maybe you meant right_on and left_index - that would be OK
Just use del
del df1, df2, df3, df4

Resources