Generating multiple columns dynamically using loop in pyspark dataframe - apache-spark

I have a requirement where I have to generate multiple columns dynamically in pyspark. I have written a similar code as below to accomplish the same.
sc = SparkContext()
sqlContext = SQLContext(sc)
cols = ['a','b','c']
df = sqlContext.read.option("header","true").option("delimiter", "|").csv("C:\\Users\\elkxsnk\\Desktop\\sample.csv")
for i in cols:
df1 = df.withColumn(i,lit('hi'))
df1.show()
However I am missing out columns a and b in the final result. Please help.
Changed the code like below. its working now, but wanted to know if there is a better way of handling it.
cols = ['a','b','c']
cols_add = []
flg_first = 'Y'
df = sqlContext.read.option("header","true").option("delimiter", "|").csv("C:\\Users\\elkxsnk\\Desktop\\sample.csv")
for i in cols:
print('start'+str(df.columns))
if flg_first == 'Y':
df1 = df.withColumn(i,lit('hi'))
cols_add.append(i)
flg_first = 'N'
else:enter code here
df1 = df1.select(df.columns+cols_add).withColumn(i,lit('hi'))
cols_add.append(i)
print('end' + str(df1.columns))
df1.show()

Related

Python: Import multiple dataframes using for loop

I have the following code which works to import a dataframe.
#read tblA
tbl = 'a'
cols = 'imp_a'
usecols = dfDD[dfDD[cols].notnull()][cols].values.tolist()
dfa = getdf(tbl, dfRT, sfsession)
dfa = dfa[usecols]
#read tblB
tbl = 'b'
cols = 'imp_sb'
usecols = dfDD[dfDD[cols].notnull()][cols].values.tolist()
dfb = getdf(tbl, dfRT, sfsession)
dfb = dfb[usecols]
#importing a few more tables in the steps as above two...
Is there a way to shorten this code and avoiding writing the same thing multiple times. The values that change are tbl, cols, dataframe name (df..)
I tried a few different things including putting all the changing attributes into a dictionary, but wasn't able to make it work. I could create a function, but the function would require a few more parameters - dfDD, dfRT, sfsession. I don't think it's a great solution. There has to be a better way to write this.
The loop should be fairly simple like this -
import pandas as pd
Create a dictionary that will store your dataframes.
df_dict = {}
config = {'tblA': {'tbl': 'a', 'cols': 'imp_a'},
'tblB': {'tbl': 'b', 'cols': 'imp_sb'}}
# Loop through the config
for key, val in config.items():
tbl = val['tbl']
cols = val['cols']
usecols = dfDD[dfDD[cols].notnull()][cols].values.tolist()
df = getdf(tbl, dfRT, sfsession)[usecols]
df_dict [key] = df # Store your dataframe in the dictionary
print(f"Created dataframe for table - {key} ({tbl} | {cols})")

Filter dataframe based on groupby sum()

I want to filter my dataframe based on a groupby sum(). I am looking for lines where the amounts for a spesific date, gets to zero.
I have solve this by creating a for loop. I suspect this will reduce performance if the dataframe is large.
It also seems clunky.
newdf = pd.DataFrame()
newdf['name'] = ('leon','eurika','monica','wian')
newdf['surname'] = ('swart','swart','swart','swart')
newdf['birthdate'] = ('14051981','198001','20081012','20100621')
newdf['tdate'] = ('13/05/2015','14/05/2015','15/05/2015', '13/05/2015')
newdf['tamount'] = (100.10, 111.11, 123.45, -100.10)
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
df2 = df.loc[df["tamount"] == 0, "tdate"]
df3 = pd.DataFrame()
for i in df2:
df3 = df3.append(newdf.loc[newdf["tdate"] == i])
print (df3)
The below code is creating an output of the two lines getting to zero when combined on tamount
name surname birthdate tdate tamount
0 leon swart 1981-05-14 13/05/2015 100.1
3 wian swart 2010-06-21 13/05/2015 -100.1
Just use basic numpy :)
import numpy as np
df = newdf.groupby(['tdate'])[['tamount']].sum().reset_index()
dates = df['tdate'][np.where(df['tamount'] == 0)[0]]
newdf[np.isin(newdf['tdate'], dates) == True]
Hope this helps; let me know if you have any questions.

PySpark: Replace Punctuations with Space Looping Through Columns

I have the following code running successfully in PySpark:
def pd(data):
df = data
df = df.select('oproblem')
text_col = ['oproblem']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
But when I add a second column in and try to loop it, it doesn't work:
def pd(data):
df = data
df = df.select('oproblem', 'lca')
text_col = ['oproblem', 'lca']
for i in text_col:
df = df.withColumn(i, F.lower(F.col(i)))
df = df.withColumn(i, F.regexp_replace(F.col(i), '[.,#-:;/?!\']', ' '))
return df
Below is the error I get:
TypeError: 'Column' object is not callable
I think it should be df = df.select(['oproblem', 'lca']) instead of df = df.select('oproblem', 'lca').
Better yet for code quality purposes, have the select statement use the text_columns variable, so you only have to change 1 line of code if you need to do this with more columns or if your column names change. Eg,
def pd(data):
df = data
text_col = ['oproblem', 'lca']
df = df.select(text_col)
....

Pandas: Join multiple data frame on the same keys

I need to join 5 data frames using the same key. I created several temporary data frame while doing the join. The code below works fine, but I am wondering is there a more elegant way to achieve this goal? Thanks!
df1 = pd.read_pickle('df1.pkl')
df2 = pd.read_pickle('df2.pkl')
df3 = pd.read_pickle('df3.pkl')
df4 = pd.read_pickle('df4.pkl')
df5 = pd.read_pickle('df5.pkl')
tmp_1 = pd.merge(df1, df2, how ='outer', on = ['id','week'])
tmp_2 = pd.merge(tmp_1, df3, how ='outer', on = ['id','week'])
tmp_3 = pd.merge(tmp_2, df4, how ='outer', on = ['id','week'])
result_df = pd.merge(tmp_3, df5, how ='outer', on = ['id','week'])
Use pd.concat after setting the index
dfs = [df1, df2, df3, df4, df5]
cols = ['id', 'weedk']
df = pd.concat([d.set_index(cols) for d in dfs], axis=1).reset_index()
Include file reading
from glob import glob
def rp(f):
return pd.read_pickle(f).set_index(['id', 'week'])
df = pd.concat([rp(f) for f in glob('df[1-5].pkl')], axis=1).reset_index()

HiveQL to PySpark - issue with aggregated column in SELECT statement

I have following HQL script which needs to be puti nto pyspark, spark 1.6
insert into table db.temp_avg
select
a,
avg(b) ,
c
from db.temp WHERE flag is not null GROUP BY a, c;
I created few versions of spark code, but I'm stuggling how to get this averaged column into select.
Also I found out that groupped data cannot be write this way:
df3 = df2.groupBy...
df3.write.mode('overwrite').saveAsTable('db.temp_avg')
part of pyspark code:
temp_table = sqlContext.table("db.temp")
df = temp_table.select('a', 'avg(b)', 'c', 'flag').toDF('a', 'avg(b)', 'c', 'flag')
df = df.where(['flag'] != 'null'))
# this ofc does not work along with the avg(b)
df2 = df.groupBy('a', 'c')
df3.write.mode('overwrite').saveAsTable('db.temp_avg')
Thx for your help.
Correct solution:
import pyspark.sql.functions as F
df = sqlContext.sql("SELECT * FROM db.temp_avg").alias("temp")
df = df.select('a', 'b', 'c')\
.filter(F.col("temp.flag").isNotNULL())\
.groupby('a', 'c')\
.agg(F.avg('b').alias("avg_b"))
import pyspark.sql.functions as F
df = sqlContext.sql("select * from db.temp_avg")
df = df.select('a',
b,
'c')\
.filter(F.col("flag").isNotNULL())\
.groupby('a', 'c')\
.agg(F.avg('b').alias("avg_b"))
Then you can save the table by
df.saveAsTable("tabe_name")

Resources