How to get key of single datafram in spark joins - apache-spark

suppose I have 2 datasets like below
book
case class Book(book_name: String, cost: Int, writer_id:Int)
val bookDS = Seq(
Book("Scala", 400, 1),
Book("Spark", 500, 2),
Book("Kafka", 300, 3),
Book("Java", 350, 5)
).toDS()
bookDS.show()
Writer
case class Writer(writer_name: String, writer_id:Int)
val writerDS = Seq(
Writer("Martin",1),
Writer("Zaharia " 2),
Writer("Neha", 3),
Writer("James", 4)
).toDS()
writerDS.show()
When I inner join it it returns two times writer_id.
How can I get writer_id of only one dataset.
I don't want to write sql like select a.something,b.something.

writerDS.join(bookDS, Seq("writer_id")).show()
Output:
+---------+-----------+---------+----+
|writer_id|writer_name|book_name|cost|
+---------+-----------+---------+----+
| 1| Martin| Scala| 400|
| 2| Zaharia| Spark| 500|
| 3| Neha| Kafka| 300|
+---------+-----------+---------+----+

When we join two datasets all columns from both datasets will presetn in result dataset
So you can rename it and then drop one of those two column.
Dataset<Row> joinedDataset = bookDs
.withcolumnRenamed(writer_id,book_writer_id)
.join(writerDS,new Column(book_writer_id).equalTo(writer_id),"inner")
.drop(book_writer_id);
Not sure if you are using python or scala.
Its a java code please convert it accordingly.

Related

spark date_format results showing null

I have data source like below:
order_id,order_date,order_customer_id,order_status
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
I am trying to convert to mm/dd/yyyy only for CLOSED orders using the below queries but getting output as null. can you please assist to get required date format using DSL or spark sql method:
closed_df=ord_df.select(date_format(to_date('order_date','yyyy-mm-dd hh:mm:SS.a'),'mm/dd/yyyy') .\
alias("formate_date")).show()
#output:
|formate_date|
+------------+
| null|
| null|
ord_df.createOrReplaceTempView("orders")
cld_df = spark.sql( """select order_id, date_format(to_date("order_date","yyyy-mm-dd hh:mm:ss.a"),'mm/dd/yyyy') as order_date,\
order_customer_id, order_status \
from orders where order_status = 'CLOSED'""").show()
#output:
|order_id|order_date|order_customer_id|order_status|
+--------+----------+-----------------+------------+
| 1| null| 11599| CLOSED|
| 4| null| 8827| CLOSED
The format to date from the string 2013-07-25 00:00:00.0 is yyyy-MM-dd HH:mm:SS.s. Likewise for date formatting the format is MM/dd/yyyy. Here the Spark formatting doc for more information.
data = [(1, "2013-07-25 00:00:00.0", 11599, "CLOSED",),
(2, "2013-07-25 00:00:00.0", 256, "PENDING_PAYMENT",),
(3, "2013-07-25 00:00:00.0", 12111, "COMPLETE",),
(4, "2013-07-25 00:00:00.0", 8827, "CLOSED",), ]
ord_df = spark.createDataFrame(data, ("order_id", "order_date", "order_customer_id", "order_status",))
from pyspark.sql.functions import to_date, date_format
closed_df = (ord_df.where("order_status = 'CLOSED'")
.select(date_format(to_date('order_date','yyyy-MM-dd HH:mm:SS.s'),'MM/dd/yyyy')
.alias("formate_date"))).show()
"""
+------------+
|formate_date|
+------------+
| 07/25/2013|
| 07/25/2013|
+------------+
"""
ord_df.createOrReplaceTempView("orders")
cld_df = spark.sql( """select order_id, date_format(to_date(order_date,"yyyy-MM-dd HH:mm:SS.s"), "MM/dd/yyyy") as order_date,\
order_customer_id, order_status \
from orders where order_status = 'CLOSED'""").show()
"""
+--------+----------+-----------------+------------+
|order_id|order_date|order_customer_id|order_status|
+--------+----------+-----------------+------------+
| 1|07/25/2013| 11599| CLOSED|
| 4|07/25/2013| 8827| CLOSED|
+--------+----------+-----------------+------------+
"""

Pyspark - filter, groupby, aggregate for different combinations of columns and functions

I have a simple operation to do in Pyspark but I need to run the operation with many different parameters. It is just filter on one column, then groupby a different column, and aggregate on a third column. In Python, the function is:
def filter_gby_reduce(df, filter_col = None, filter_value = None):
return df.filter(col(filter_col) == filter_value).groupby('ID').agg(max('Value'))
Let's say the different configurations are
func_params = spark.createDataFrame([('Day', 'Monday'), ('Month', 'January')], ['feature', 'filter_value'])
I could of course just run the functions one by one:
filter_gby_reduce(df, filter_col = 'Day', filter_value = 'Monday')
filter_gby_reduce(df, filter_col = 'Month', filter_value = 'January')
But my actual collection of parameters is much larger. Lastly, I also need to union all of the function results together into one dataframe. So is there a way in spark to write this more succinctly and in a way that will fully take advantage of parallelization?
One way of doing this is by generating the desired values as columns using when and max and passing these to agg. As you want the values unioned you have to unpivot the result using stack (no DataFrame API for that, so a selectExpr is used). Depending on your dataset you might get null if a filter excludes all data, these can be dropped if needed.
I'd recommend testing this vs the 'naive' approach of simply unioning a large amount of filtered dataframes.
import pyspark.sql.functions as f
func_params = [('Day', 'Monday'), ('Month', 'January')]
df = spark.createDataFrame([
('Monday', 'June', 1, 5),
('Monday', 'January', 1, 2),
('Monday', 'June', 1, 5),
('Monday', 'June', 2, 10)],
['Day', 'Month', 'ID', 'Value'])
cols = []
for column, flt in func_params:
name = f'{column}_{flt}'
val = f.when(f.col(column) == flt, f.col('Value')).otherwise(None)
cols.append(f.max(val).alias(name))
stack = f"stack({len(cols)}," + ','.join(f"'{column}_{flt}', {column}_{flt}" for column, flt in func_params) + ')'
(df
.groupby('ID')
.agg(*cols)
.selectExpr('ID', stack)
.withColumnRenamed('col0', 'param')
.withColumnRenamed('col1', 'Value')
.show()
)
+---+-------------+-----+
| ID| param|Value|
+---+-------------+-----+
| 1| Day_Monday| 5|
| 1|Month_January| 2|
| 2| Day_Monday| 10|
| 2|Month_January| null|
+---+-------------+-----+

Merging two or more dataframes/rdd efficiently in PySpark

I'm trying to merge three RDD's based on the same key. The following is the data.
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 15|
| 3| Candy| 15|
| 1| Bahroze| 15|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 7342|
| 3| Candy| 5669|
| 1| Bahroze| 8361|
+------+---------+-----+
+------+---------+-----+
|UserID|UserLabel|Total|
+------+---------+-----+
| 2| Panda| 37|
| 3| Candy| 27|
| 1| Bahroze| 39|
+------+---------+-----+
I'm able to merge these three DF. I converted them to RDD dict with the following code for all three
new_rdd = userTotalVisits.rdd.map(lambda row: row.asDict(True))
After RDD conversion, I'm taking one RDD and the other two as lists. Mapping the first RDD and then adding other keys to it based on the same UserID. I was hoping there was a better way of doing this using pyspark. Here's the code I've written.
def transform(row):
# Add a new key to each row
for x in conversion_list: # first rdd in list of object as[{}] after using collect()
if( x['UserID'] == row['UserID'] ):
row["Total"] = { "Visitors": row["Total"], "Conversions": x["Total"] }
for y in Revenue_list: # second rdd in list of object as[{}] after using collect()
if( y['UserID'] == row['UserID'] ):
row["Total"]["Revenue"] = y["Total"]
return row
potato = new_rdd.map(lambda row: transform(row)) #first rdd
How should I efficiently merge these three RDDs/DFs? (because I had to perform three different task on a huge DF). Looking for a better efficient idea. PS I'm still spark newbie. The result of my code does is as follows which is what I need.
{'UserID': '2', 'UserLabel': 'Panda', 'Total': {'Visitors': 37, 'Conversions': 15, 'Revenue': 7342}}
{'UserID': '3', 'UserLabel': 'Candy', 'Total': {'Visitors': 27, 'Conversions': 15, 'Revenue': 5669}}
{'UserID': '1', 'UserLabel': 'Bahroze', 'Total': {'Visitors': 39, 'Conversions': 15, 'Revenue': 8361}}
Thank you.
You can join the 3 dataframes on columns ["UserID", "UserLabel"], create a new struct total from the 3 total columns:
from pyspark.sql import functions as F
result = df1.alias("conv") \
.join(df2.alias("rev"), ["UserID", "UserLabel"], "left") \
.join(df3.alias("visit"), ["UserID", "UserLabel"], "left") \
.select(
F.col("UserID"),
F.col("UserLabel"),
F.struct(
F.col("conv.Total").alias("Conversions"),
F.col("rev.Total").alias("Revenue"),
F.col("visit.Total").alias("Visitors")
).alias("Total")
)
# write into json file
result.write.json("output")
# print result:
for i in result.toJSON().collect():
print(i)
# {"UserID":3,"UserLabel":"Candy","Total":{"Conversions":15,"Revenue":5669,"Visitors":27}}
# {"UserID":1,"UserLabel":"Bahroze","Total":{"Conversions":15,"Revenue":8361,"Visitors":39}}
# {"UserID":2,"UserLabel":"Panda","Total":{"Conversions":15,"Revenue":7342,"Visitors":37}}
You can just do the left joins on all the three dataframes but make sure the first dataframe that you use has all the UserID and UserLabel Values. You can ignore the GroupBy operation as suggested by #blackbishop and still it would give you the required output
I am showing how it can be done in scala but you could do something similar in python.
//source data
val visitorDF = Seq((2,"Panda",15),(3,"Candy",15),(1,"Bahroze",15),(4,"Test",25)).toDF("UserID","UserLabel","Total")
val conversionsDF = Seq((2,"Panda",37),(3,"Candy",27),(1,"Bahroze",39)).toDF("UserID","UserLabel","Total")
val revenueDF = Seq((2,"Panda",7342),(3,"Candy",5669),(1,"Bahroze",8361)).toDF("UserID","UserLabel","Total")
import org.apache.spark.sql.functions._
val finalDF = visitorDF.as("v").join(conversionsDF.as("c"),Seq("UserID","UserLabel"),"left")
.join(revenueDF.as("r"),Seq("UserID","UserLabel"),"left")
.withColumn("TotalArray",struct($"v.Total".as("Visitor"),$"c.Total".as("Conversions"),$"r.Total".as("Revenue")))
.drop("Total")
display(finalDF)
You can see the output as below :

Array Intersection in Spark SQL

I have a table with a array type column named writer which has the values like array[value1, value2], array[value2, value3].... etc.
I am doing self join to get results which have common values between arrays. I tried:
sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECTION(R1.writer, R2.writer)[0] is not null ")
And
sqlContext.sql("SELECT R2.writer FROM table R1 JOIN table R2 ON R1.id != R2.id WHERE ARRAY_INTERSECT(R1.writer, R2.writer)[0] is not null ")
But got same exception:
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Undefined function: 'ARRAY_INTERSECT'. This function is neither a
registered temporary function nor a permanent function registered in
the database 'default'.; line 1 pos 80
Probably Spark SQL does not support ARRAY_INTERSECTION and ARRAY_INTERSECT. How can I achieve my goal in Spark SQL?
Since Spark 2.4 array_intersect function can be used directly in SQL
spark.sql(
"SELECT array_intersect(array(1, 42), array(42, 3)) AS intersection"
).show()
+------------+
|intersection|
+------------+
| [42]|
+------------+
and Dataset API:
import org.apache.spark.sql.functions.array_intersect
Seq((Seq(1, 42), Seq(42, 3)))
.toDF("a", "b")
.select(array_intersect($"a", $"b") as "intersection")
.show()
+------------+
|intersection|
+------------+
| [42]|
+------------+
Equivalent functions are also present in the other languages:
pyspark.sql.functions.array_intersect in PySpark.
SparkR::array_intersect in SparkR.
You'll need an udf:
import org.apache.spark.sql.functions.udf
spark.udf.register("array_intersect",
(xs: Seq[String], ys: Seq[String]) => xs.intersect(ys))
and then check if intersection is empty:
scala> spark.sql("SELECT size(array_intersect(array('1', '2'), array('3', '4'))) = 0").show
+-----------------------------------------+
|(size(UDF(array(1, 2), array(3, 4))) = 0)|
+-----------------------------------------+
| true|
+-----------------------------------------+
scala> spark.sql("SELECT size(array_intersect(array('1', '2'), array('1', '4'))) = 0").show
+-----------------------------------------+
|(size(UDF(array(1, 2), array(1, 4))) = 0)|
+-----------------------------------------+
| false|
+-----------------------------------------+

Aggregate First Grouped Item from Subsequent Items

I have user game sessions containing: user id, game id, score and a timestamp when the game was played.
from pyspark import SparkContext
from pyspark.sql import HiveContext
from pyspark.sql import functions as F
sc = SparkContext("local")
sqlContext = HiveContext(sc)
df = sqlContext.createDataFrame([
("u1", "g1", 10, 0),
("u1", "g3", 2, 2),
("u1", "g3", 5, 3),
("u1", "g4", 5, 4),
("u2", "g2", 1, 1),
], ["UserID", "GameID", "Score", "Time"])
Desired Output
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+
I want to transform the data such that I get the max score of the first game the user played as well as the max score of the second game (bonus if I can also get the max score of all subsequent games). Unfortunately I'm not sure how that's possible to do with Spark SQL.
I know I can group by UserID, GameID and then agg to get the max score and min time. Not sure to how to proceed from there.
Clarification: note that MaxScoreGame1 and MaxScoreGame2 refer to the first and second game user player; not the GameID.
You could try using a combination of Window functions and Pivot.
Get the row number for every game partitioned by UserID ordered by Time.
Filter down to GameNumber being 1 or 2.
Pivot on that to get your desired output shape.
Unfortunately I am using scala not python, but the below should be fairly easily transferable to python library.
import org.apache.spark.sql.expressions.Window
// Use a window function to get row number
val rowNumberWindow = Window.partitionBy(col("UserId")).orderBy(col("Time"))
val output = {
df
.select(
col("*"),
row_number().over(rowNumberWindow).alias("GameNumber")
)
.filter(col("GameNumber") <= lit(2))
.groupBy(col("UserId"))
.pivot("GameNumber")
.agg(
sum(col("Score"))
)
}
output.show()
+------+---+----+
|UserId| 1| 2|
+------+---+----+
| u1| 10| 2|
| u2| 1|null|
+------+---+----+
Solution with PySpark:
from pyspark.sql import Window
rowNumberWindow = Window.partitionBy("UserID").orderBy(F.col("Time"))
(df
.groupBy("UserID", "GameID")
.agg(F.max("Score").alias("Score"),
F.min("Time").alias("Time"))
.select(F.col("*"),
F.row_number().over(rowNumberWindow).alias("GameNumber"))
.filter(F.col("GameNumber") <= F.lit(2))
.withColumn("GameMaxScoreCol", F.concat(F.lit("MaxScoreGame"), F.col("GameNumber")))
.groupBy("UserID")
.pivot("GameMaxScoreCol")
.agg(F.max("Score"))
).show()
+------+-------------+-------------+
|UserID|MaxScoreGame1|MaxScoreGame2|
+------+-------------+-------------+
| u1| 10| 5|
| u2| 1| null|
+------+-------------+-------------+

Resources