How to concatenate data frame column pyspark? - apache-spark

I have created data frame using below code:
df = spark.createDataFrame([("A", "20"), ("B", "30"), ("D", "80"),("A", "120"),("c", "20"),("Null", "20")],["Let", "Num"])
df.show()
+----+---+
| Let|Num|
+----+---+
| A| 20|
| B| 30|
| D| 80|
| A|120|
| c| 20|
|Null| 20|
+----+---+
I want create data frame like below:
+----+-------+
| Let|Num |
+----+-------+
| A| 20,120|
| B| 30 |
| D| 80 |
| c| 20 |
|Null| 20 |
+----+-------+
how to achieve this?

You can groupBy Let and collect as list with collect_list
from pyspark.sql import functions as F
df.groupBy("Let").agg(F.collect_list("Num")).show()
Output as List:
+----+-----------------+
| Let|collect_list(Num)|
+----+-----------------+
| B| [30]|
| D| [80]|
| A| [20, 120]|
| c| [20]|
|Null| [20]|
+----+-----------------+
df.groupBy("Let").agg(F.concat_ws(",", F.collect_list("Num"))).show()
Output as String
+----+-------------------------------+
| Let|concat_ws(,, collect_list(Num))|
+----+-------------------------------+
| B| 30|
| D| 80|
| A| 20,120|
| c| 20|
|Null| 20|
+----+-------------------------------+

Related

Employing Pyspark ,determine the leading value and display it in a new column

Sl Now
1 D
2 D
3 D
4 R
5 R
6 C
7 C
8 C
9 D
10 P
11 R
12 R
13 D
I have a dataset like above.
Sl Now lead
1 D R
2 D R
3 D R
4 R C
5 R C
6 C D
7 C D
8 C D
9 D P
10 P R
11 R D
12 R D
13 D
I want to add a column called "lead" that will display the subsequent value of the "Now" column and will match the number of values in the 'Now' column. Can we carry this out with PySpark?
Here is how I would do it.
Prep data
a = "DDDRRCCCDPRRD"
a = list(zip(range(len(a)), a))
b = ["Sl","Now"]
df = spark.createDataFrame(a,b)
df.show()
+---+---+
| Sl|now|
+---+---+
| 0| D|
| 1| D|
| 2| D|
| 3| R|
| 4| R|
| 5| C|
| 6| C|
| 7| C|
| 8| D|
| 9| P|
| 10| R|
| 11| R|
| 12| D|
+---+---+
Imports
from pyspark.sql import functions as F
from pyspark.sql import Window as W
add an incremental ID
df = df.withColumn("id", F.when(F.col("now") == F.lag("Now").over(W.orderBy("Sl")), 0).otherwise(1))
df = df.withColumn("id", F.sum("id").over(W.orderBy("Sl")))
df.show()
+---+---+---+
| Sl|now| id|
+---+---+---+
| 0| D| 1|
| 1| D| 1|
| 2| D| 1|
| 3| R| 2|
| 4| R| 2|
| 5| C| 3|
| 6| C| 3|
| 7| C| 3|
| 8| D| 4|
| 9| P| 5|
| 10| R| 6|
| 11| R| 6|
| 12| D| 7|
+---+---+---+
Get the lead value (method 1)
df = df.withColumn("lead", F.collect_set("now").over(W.orderBy("id").rangeBetween(1, 1)))
df = df.select("Sl", "now", F.explode_outer("lead").alias("lead"))
df.show()
+---+---+----+
| Sl|now|lead|
+---+---+----+
| 0| D| R|
| 1| D| R|
| 2| D| R|
| 3| R| C|
| 4| R| C|
| 5| C| D|
| 6| C| D|
| 7| C| D|
| 8| D| P|
| 9| P| R|
| 10| R| D|
| 11| R| D|
| 12| D|null|
+---+---+----+
Get the lead value (method 2)
df2 = df.select(F.col("now").alias("lead"), (F.col("id") - 1).alias("id")).distinct()
df1 = df.join(df2, how="left", on="id")
df1.select("Sl", "now", "lead").show()
+---+---+----+
| Sl|now|lead|
+---+---+----+
| 0| D| R|
| 1| D| R|
| 2| D| R|
| 3| R| C|
| 4| R| C|
| 5| C| D|
| 6| C| D|
| 7| C| D|
| 8| D| P|
| 9| P| R|
| 10| R| D|
| 11| R| D|
| 12| D|null|
+---+---+----+

How can I achieve following spark behaviour using replaceWhere clause

I want to write data in delta tables incrementally while replacing (overwriting) partitions already present in sink. Example:
Consider this data inside my delta table already partionned by id column:
+---+---+
| id| x|
+---+---+
| 1| A|
| 2| B|
| 3| C|
+---+---+
Now, I would like to insert the following dataframe:
+---+---------+
| id| x|
+---+---------+
| 2| NEW|
| 2| NEW|
| 4| D|
| 5| E|
+---+---------+
The desired output is this
+---+---------+
| id| x|
+---+---------+
| 1| A|
| 2| NEW|
| 2| NEW|
| 3| C|
| 4| D|
| 5| E|
+---+---------+
What I did is the following:
df = spark.read.format("csv").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/in/input.csv")
Ids=[x.id for x in df.select("id").distinct().collect()]
for Id in Ids:
df.filter(df.id==Id).write.format("delta").option("mergeSchema", "true").partitionBy("id").option("replaceWhere", "id == '$i'".format(i=Id)).mode("append").save("/mnt/blob/datafinance/bronze/simba/test/res/")
spark.read.format("delta").option("sep", ";").option("header", "true").load("/mnt/blob/datafinance/bronze/simba/test/res/").show()
And this is the result:
+---+---------+
| id| x|
+---+---------+
| 2| B|
| 1| A|
| 5| E|
| 2| NEW|
| 2|NEW AUSSI|
| 3| C|
| 4| D|
+---+---------+
As you can see it appended all value without replacing the partition id=2 which was already present in table.
I think it is because of mode("append").
But changing it to mode("overwrite") throws the following error:
Data written out does not match replaceWhere 'id == '$i''.
Can anyone tell me how to achieve what I want please ?
Thank you.
I actually had an error in the code. I replaced
.option("replaceWhere", "id == '$i'".format(i=idd))
with
.option("replaceWhere", "id == '{i}'".format(i=idd))
and it worked.
Thanks to #ggordon who noticed me about the error on another question.

Pivot on two columns with both numeric and categorical value in pySpark

I have a data set in pyspark like this :
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
user_df.show()
+---+----+--------+-----+
| id|time|category|value|
+---+----+--------+-----+
| 1| 1| speed| 50|
| 1| 1| speed| 60|
| 1| 2| door| open|
| 1| 2| door| open|
| 1| 2| door|close|
| 1| 2| speed| 75|
| 2| 10| speed| 30|
| 2| 11| door| open|
| 2| 12| door| open|
| 2| 13| speed| 50|
| 2| 13| speed| 40|
+---+----+--------+-----+
What I want to get is something like below where grouping by id and time and pivot on category and if it is numeric return the average and if it is categorical it returns the mode.
+---+----+--------+-----+
| id|time| door|speed|
+---+----+--------+-----+
| 1| 1| null| 55|
| 1| 2| open| 75|
| 2| 10| null| 30|
| 2| 11| open| null|
| 2| 12| open| null|
| 2| 13| null| 45|
+---+----+--------+-----+
I tried this but for categorical value it returns null (I am not worry about nulls in speed column)
df = user_df\
.groupBy('id','time')\
.pivot('category')\
.agg(avg('value'))\
.orderBy(['id', 'time'])\
df.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|null| 75.0|
| 2| 10|null| 30.0|
| 2| 11|null| null|
| 2| 12|null| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
You can do an additional pivot and coalesce them. try this.
import pyspark.sql.functions as F
from collections import namedtuple
user_row = namedtuple('user_row', 'id time category value'.split())
data = [
user_row(1,1,'speed','50'),
user_row(1,1,'speed','60'),
user_row(1,2,'door', 'open'),
user_row(1,2,'door','open'),
user_row(1,2,'door','close'),
user_row(1,2,'speed','75'),
user_row(2,10,'speed','30'),
user_row(2,11,'door', 'open'),
user_row(2,12,'door','open'),
user_row(2,13,'speed','50'),
user_row(2,13,'speed','40')
]
user_df = spark.createDataFrame(data)
#%%
#user_df.show()
df = user_df.groupBy('id','time')\
.pivot('category')\
.agg(F.avg('value').alias('avg'),F.max('value').alias('max'))\
#%%
expr1= [x for x in df.columns if '_avg' in x]
expr2= [x for x in df.columns if 'max' in x]
expr=zip(expr1,expr2)
#%%
sel_expr= [F.coalesce(x[0],x[1]).alias(x[0].split('_')[0]) for x in expr]
#%%
df_final = df.select('id','time',*sel_expr).orderBy('id','time')
df_final.show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 1| 2|open| 75.0|
| 2| 10|null| 30.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 13|null| 45.0|
+---+----+----+-----+
Try collecting the data and transforming as required
spark 2.4+
user_df.groupby('id','time').pivot('category').agg(collect_list('value')).\
select('id','time',col('door')[0].alias('door'),expr('''aggregate(speed, cast(0.0 as double), (acc, x) -> acc + x, acc -> acc/size(speed))''').alias('speed')).show()
+---+----+----+-----+
| id|time|door|speed|
+---+----+----+-----+
| 1| 1|null| 55.0|
| 2| 13|null| 45.0|
| 2| 11|open| null|
| 2| 12|open| null|
| 2| 10|null| 30.0|
| 1| 2|open| 75.0|
+---+----+----+-----+

pyspark : fetch common data from dataframe when comparing values of given columns

I have two pyspark dataframes like this.
data_frame A
+----+---+
|name1| id1|
+----+---+
| a| 3|
| b| 5|
| c| 7|
+----+---+
data_frame B
+----+---+
|name2| id2|
+----+---+
| a| 13|
| b| 15|
| c| 17|
| d| 6|
| e| 0|
| f| 3|
+----+---+
I want to fetch dataframe B contents if values of name1 (from df a) and name2 (from df b) matches. which is as shown below.
o/p dataframe
+----+---+
|name2| id2|
+----+---+
| a| 13|
| b| 15|
| c| 17|
+----+---+
I want to avoid computationally expensive methods such as collect() etc.
How this can be done in apache spark?
from pyspark.sql.functions import *
df1.join(df2, df1.name1 == df2.name2).select('df2.*')
OR (using sql)
df1.registerTempTable("tableA")
df2.registerTempTable("tableB")
val result = sqlContext.sql("select b.name2, b.id2 from tableA a join tableB b on a.name1=b.name2")
result.show()
+----+----+
|name2| id2|
+----+----+
| a| 13|
| b| 15|
| c| 17|
+----+---+

Full outer join in pyspark data frames

I have created two data frames in pyspark like below. In these data frames I have column id. I want to perform a full outer join on these two data frames.
valuesA = [('Pirate',1),('Monkey',2),('Ninja',3),('Spaghetti',4)]
a = sqlContext.createDataFrame(valuesA,['name','id'])
a.show()
+---------+---+
| name| id|
+---------+---+
| Pirate| 1|
| Monkey| 2|
| Ninja| 3|
|Spaghetti| 4|
+---------+---+
valuesB = [('dave',1),('Thor',2),('face',3), ('test',5)]
b = sqlContext.createDataFrame(valuesB,['Movie','id'])
b.show()
+-----+---+
|Movie| id|
+-----+---+
| dave| 1|
| Thor| 2|
| face| 3|
| test| 5|
+-----+---+
full_outer_join = a.join(b, a.id == b.id,how='full')
full_outer_join.show()
+---------+----+-----+----+
| name| id|Movie| id|
+---------+----+-----+----+
| Pirate| 1| dave| 1|
| Monkey| 2| Thor| 2|
| Ninja| 3| face| 3|
|Spaghetti| 4| null|null|
| null|null| test| 5|
+---------+----+-----+----+
I want to have a result like below when I do a full_outer_join
+---------+-----+----+
| name|Movie| id|
+---------+-----+----+
| Pirate| dave| 1|
| Monkey| Thor| 2|
| Ninja| face| 3|
|Spaghetti| null| 4|
| null| test| 5|
+---------+-----+----+
I have done like below but getting some different result
full_outer_join = a.join(b, a.id == b.id,how='full').select(a.id, a.name, b.Movie)
full_outer_join.show()
+---------+----+-----+
| name| id|Movie|
+---------+----+-----+
| Pirate| 1| dave|
| Monkey| 2| Thor|
| Ninja| 3| face|
|Spaghetti| 4| null|
| null|null| test|
+---------+----+-----+
As you can see that I am missing Id 5 in my result data frame.
How can I achieve what I want?
Since the join columns have the same name, you can specify the join columns as a list:
a.join(b, ['id'], how='full').show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
Or coalesce the two id columns:
import pyspark.sql.functions as F
a.join(b, a.id == b.id, how='full').select(
F.coalesce(a.id, b.id).alias('id'), a.name, b.Movie
).show()
+---+---------+-----+
| id| name|Movie|
+---+---------+-----+
| 5| null| test|
| 1| Pirate| dave|
| 3| Ninja| face|
| 2| Monkey| Thor|
| 4|Spaghetti| null|
+---+---------+-----+
You can either reaname the column id from the dataframe b and drop later or use the list in join condition.
a.join(b, ['id'], how='full')
Output:
+---+---------+-----+
|id |name |Movie|
+---+---------+-----+
|1 |Pirate |dave |
|3 |Ninja |face |
|5 |null |test |
|4 |Spaghetti|null |
|2 |Monkey |Thor |
+---+---------+-----+

Resources