How to use groupby with array elements in Pyspark? - apache-spark

I'm running a groupBy operation on a dataframe in Pyspark and I need to groupby a list which may be by one or two features.. How can I execute this?
record_fields = [['record_edu_desc'], ['record_construction_desc'],['record_cost_grp'],['record_bsmnt_typ_grp_desc'], ['record_shape_desc'],
['record_sqft_dec_grp', 'record_renter_grp_c_flag'],['record_home_age'],
['record_home_age_grp','record_home_age_missing']]
for field in record_fields:
df_group = df.groupBy('year', 'area', 'state', 'code', field).sum('net_contributions')
### df write to csv operation
My first thought was to create a list of lists and pass it to the groupby operation, but I get the following error:
TypeError: Invalid argument, not a string or column:
['record_edu_desc'] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
How do I make this work? I'm open to other ways I could do this.

Try this (note that * [asterisk] before field):
for field in record_fields:
df_group = df.groupBy('year', 'area', 'state', 'code', *field).sum('net_contributions')
Also take a look at this question to know more about asterisk in python.

Related

how to add column name to the dataframe storing result of correlation of two columns in pyspark?

I have read a csv file and need to find correlation between two columns.
I am using df.stat.corr('Age','Exp') and result is 0.7924058156930612.
But I want to have this result stored in another dataframe with header as "correlation".
correlation
0.7924058156930612
Following up on what #gupta_hemant commented.
You can create a new column as
df.withColumn("correlation", df.stat.corr("Age", "Exp").collect()[0].correlation)
(I am guessing the exact syntax here, but it should be something like this)
After reviewing the code, the syntax should be
import pyspark.sql.functions as F
df.withColumn("correlation", F.lit(df.stat.corr("Age", "Exp")))
Try this and let me know.
corrValue = df.stat.corr("Age", "Exp")
newDF = spark.createDataFrame(
[
(corrValue)
],
["corr"]
)

Pyspark - how to pass a column to a function after casting?

First I called sha2 function from pyspark.sql.functions incorrectly, passing it a column of DoubleType and got the following error:
cannot resolve 'sha2(`metric`, 256)' due to data type mismatch: argument 1 requires binary type, however, '`metric`' is of double type
Then I tried to first cast the columns to a StringType but still getting the same error. I probably miss something on how column transformations are processed by Spark.
I've noticed that when I just call a df.withColumn(col_name, F.lit(df[col_name].cast(StringType()))) without calling .withColumn(col_name, F.sha2(df[col_name], 256))the columns type is changed to StringType.
How should I apply a transformation correctly in this case?
def parse_to_sha2(df: DataFrame, cols: list):
for col_name in cols:
df = df.withColumn(col_name, F.lit(df[col_name].cast(StringType()))) \
.withColumn(col_name, F.sha2(df[col_name], 256))
return df
You don't need lit here
Try
.withColumn(col_name, F.sha2(df[col_name].cast('string'), 256))
I believe the issue here is the call to F.lit which creates a literal.
def parse_to_sha2(df: DataFrame, cols: list):
for col_name in cols:
df = df.withColumn(
col_name,
F.col(col_name).cast(StringType()).alias(f"{col_name}_casted")
).withColumn(
col_name,
F.sha2(F.col(f"{col_name}_casted"), 256)
)
return df
This should generate you a sha value per column.
In case you need all of them you would need to pass all columns to sha since it takes col* of arguments.
Edit: The last bit of comment is not correct, only F.hash takes multiple columns as arguments, md5, crc, sha take only 1 so sorry for that confusion.

applying row-wise function in a data frame with values of another data frame

I have two data frames:
df1
df1 = pd.DataFrame({ 'group' : ["A","A","A","B","B","B"],
'par' : [5,5,15,10,2,11],
'val' :[50,10,180,10,10,660],
'set_0' :["Country","Country","Country","Country","Country","Country"],
'set_1' :["size1","size1","size2","size3","size4","size3"],
'set_2' :["size12","size12","size12","size9","size13","size13"],
'set_3' :["size14","size14","size15","NO","NO","NO"],
'set_4' :["NO","NO","NO","size25","size25","size27"],
'set_5' :["NO","NO","NO","NO","NO","NO"]
})
df2
df2 = pd.DataFrame({ 'group' : ["A","A","A","A","A","A","B","B","B","B","B","B","B"],
'set' : ["Country","size1","size2","size12","size14","size15","Country","size3","size4","size9","size13","size25","size27"],
})
For each row (group X set combination) of df2, I want to apply a function to calculate (sum of "val"/sum of "par").
I tried to use something with apply function but I could not really figure out since I am pretty new in python.
Could anyone please help regarding a solution?
below is the expected outcome:
As image upload keeps failing, I am also sharing below the code to get the outcome in the ugliest and most inefficient hardcoding way:
a1=df1[(df1["group"]==df2.iloc[0,0])&(df1["set_0"]==df2.iloc[0,1])].sum()["val"]
a2=df1[(df1["group"]==df2.iloc[0,0])&(df1["set_0"]==df2.iloc[0,1])].sum()["par"]
b1=df1[(df1["group"]==df2.iloc[1,0])&(df1["set_1"]==df2.iloc[1,1])].sum()["val"]
b2=df1[(df1["group"]==df2.iloc[1,0])&(df1["set_1"]==df2.iloc[1,1])].sum()["par"]
c1=df1[(df1["group"]==df2.iloc[2,0])&(df1["set_1"]==df2.iloc[2,1])].sum()["val"]
c2=df1[(df1["group"]==df2.iloc[2,0])&(df1["set_1"]==df2.iloc[2,1])].sum()["par"]
d1=df1[(df1["group"]==df2.iloc[3,0])&(df1["set_2"]==df2.iloc[3,1])].sum()["val"]
d2=df1[(df1["group"]==df2.iloc[3,0])&(df1["set_2"]==df2.iloc[3,1])].sum()["par"]
e1=df1[(df1["group"]==df2.iloc[4,0])&(df1["set_3"]==df2.iloc[4,1])].sum()["val"]
e2=df1[(df1["group"]==df2.iloc[4,0])&(df1["set_3"]==df2.iloc[4,1])].sum()["par"]
f1=df1[(df1["group"]==df2.iloc[5,0])&(df1["set_3"]==df2.iloc[5,1])].sum()["val"]
f2=df1[(df1["group"]==df2.iloc[5,0])&(df1["set_3"]==df2.iloc[5,1])].sum()["par"]
g1=df1[(df1["group"]==df2.iloc[6,0])&(df1["set_0"]==df2.iloc[6,1])].sum()["val"]
g2=df1[(df1["group"]==df2.iloc[6,0])&(df1["set_0"]==df2.iloc[6,1])].sum()["par"]
h1=df1[(df1["group"]==df2.iloc[7,0])&(df1["set_1"]==df2.iloc[7,1])].sum()["val"]
h2=df1[(df1["group"]==df2.iloc[7,0])&(df1["set_1"]==df2.iloc[7,1])].sum()["par"]
j1=df1[(df1["group"]==df2.iloc[8,0])&(df1["set_1"]==df2.iloc[8,1])].sum()["val"]
j2=df1[(df1["group"]==df2.iloc[8,0])&(df1["set_1"]==df2.iloc[8,1])].sum()["par"]
k1=df1[(df1["group"]==df2.iloc[9,0])&(df1["set_2"]==df2.iloc[9,1])].sum()["val"]
k2=df1[(df1["group"]==df2.iloc[9,0])&(df1["set_2"]==df2.iloc[9,1])].sum()["par"]
l1=df1[(df1["group"]==df2.iloc[10,0])&(df1["set_2"]==df2.iloc[10,1])].sum()["val"]
l2=df1[(df1["group"]==df2.iloc[10,0])&(df1["set_2"]==df2.iloc[10,1])].sum()["par"]
m1=df1[(df1["group"]==df2.iloc[11,0])&(df1["set_4"]==df2.iloc[11,1])].sum()["val"]
m2=df1[(df1["group"]==df2.iloc[11,0])&(df1["set_4"]==df2.iloc[11,1])].sum()["par"]
n1=df1[(df1["group"]==df2.iloc[12,0])&(df1["set_4"]==df2.iloc[12,1])].sum()["val"]
n2=df1[(df1["group"]==df2.iloc[12,0])&(df1["set_4"]==df2.iloc[12,1])].sum()["par"]
Up to this part, I had to manually create "val" and "par" pairs for each element.
Then,
a=a1/a2
b=b1/b2
c=c1/c2
d=d1/d2
e=e1/e2
f=f1/f2
g=g1/g2
h=h1/h2
j=j1/j2
k=k1/k2
l=l1/l2
m=m1/m2
n=n1/n2
Finally, the result is:
df2["desired_calculation"]=[a,b,c,d,e,f,g,h,j,k,l,m,n]

better way to select all columns and join in pyspark data frames

I have two data frames in pyspark. Their schema's are below
df1
DataFrame[customer_id: int, email: string, city: string, state: string, postal_code: string, serial_number: string]
df2
DataFrame[serial_number: string, model_name: string, mac_address: string]
Now I want to do a full outer join on these two data frames by using coalesce on the column common in both the data frames.
I have done like below. I got the expected result.
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select(df1.customer_id, df1.email, df1.city, df1.state, df1.postal_code, f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number'), df2.model_name, df2.mac_address)
Now I want to do the above little differently. Instead of writing all the column names near select in the join statement i want to do something like using * on the data frame. Basically I want something like below.
full_df = df1.join(df2, df1.serial_number == df2.serial_number, 'full_outer').select('df1.*', f.coalesce(df1.serial_number, df2.serial_number).alias('serial_number1'), df2.model_name, df2.mac_address).drop('serial_number')
I am getting what I want. Is there a better way to this kind of operation in pyspark
edit
This is not a duplicate of https://stackoverflow.com/questions/36132322/join-two-data-frames-select-all-columns-from-one-and-some-columns-from-the-othe?rq=1 I am using a coalesce in the join statement. I want to know if there is a way where we can exclude the column on which I am using the coalesce function
You can do something like this:
(df1
.join(df2, df1.serial_number == df2.serial_number, 'full_outer')
.select(
[df1[c] for c in df1.columns if c != 'serial_number'] +
[f.coalesce(df1.serial_number, df2.serial_number)]
))

spark: row to element

New to Spark.
I'd like to do some transformation on the "wordList" column of a spark DataFrame, df, of the type org.apache.spark.sql.DataFrame = [id: string, wordList: array<string>].
I use dataBricks. df looks like:
+--------------------+--------------------+
| id| wordList|
+--------------------+--------------------+
|08b0a9b6-3b9a-47a...| [a]|
|23c2ef79-8dce-4ad...|[ag, adfg, asdfgg...|
|26a7682f-2ce6-4eb...|[ghe, gener, ghee...|
|2ab530b5-04bc-463...|[bap, pemm, pava,...|
+--------------------+--------------------+
More specifically, I have defined a function shrinkList(ol: List[String]): List[String] that takes a list and returns a shorter list, and would like to apply it on the wordList column. The question is, how do I convert the row to a list?
df.select("wordList").map(t => shrinkList(t(1))) give the error: type mismatch;
found : Any
required: List[String]
Also, I'm not sure about "t(1)" here. I'd rather use the column name instead of the index, in case the order of the columns change in the future. But I can't seem to make t$"wordList" or t.wordList or t("wordList") work. So instead of using t(1), what selector can I use to select the "wordList" column?
Try:
df.select("wordList").map(t => shrinkList(t.getSeq[String](0).toList))
or
df.select("wordList").map(t => shrinkList(t.getAs[Seq[String]]("wordList").toList))

Resources