Technique for joining with spark dataframe w/ custom partitioner works w/ python, but not scala? - apache-spark

I recently read an article that described how to custom partition a dataframe
[ https://dataninjago.com/2019/06/01/create-custom-partitioner-for-spark-dataframe/ ] in which the author illustrated the technique in Python. I use Scala, and the technique looked like a good way to address issues of skew, so I tried something similar, and what I found was that when one does the following:
- create 2 data frames, D1, D2
- convert D1, D2 to 2 Pair RDDs R1,R2
(where the key is the key you want to join on)
- repartition R1,R2 with a custom partitioner 'C'
where 'C' has 2 partitions (p-0,p-1) and
stuffs everything in P-1, except keys == 'a'
- join R1,R2 as R3
- OBSERVE that:
- partitioner for R3 is 'C' (same for R1,R2)
- when printing the contents of each partition of R3 all entries
except the one keyed by 'a' is in p-1
- set D1' <- R1.toDF
- set D2' <- R2.toDF
We note the following results:
0) The join of D1' and D2' produce expected results (good)
1) The partitioners for D1' and D2' are None -- not Some(C),
as was the case with RDD's R1/R2 (bad)
2) The contents of the glom'd underlying RDDs of D1' and D2' did
not have everything (except key 'a') piled up
in partition 1 as expected.(bad)
So, I came away with the following conclusion... which will work for me practically... But it really irks me that I could not get the behavior in the article which used Python:
When one needs to use custom partitioning with Dataframes in Scala one must
drop into RDD's do the join or whatever operation on the RDD, then convert back
to dataframe. You can't apply the custom partitioner, then convert back to
dataframe, do your operations, and expect the custom partitioning to work.
Now...I am hoping I am wrong ! Perhaps someone with more expertise in Spark internals can guide me here. I have written a little program (below) to illustrate the results. Thanks in advance if you can set me straight.
UPDATE
In addition to the Spark code which illustrates the problem I also tried a simplified version of what the original article presented in Python. The conversions below create a dataframe, extract its underlying RDD and repartition it, then recover the dataframe and verify that the partitioner is lost.
Python snippet illustrating problem
from pyspark.sql.types import IntegerType
mylist = [1, 2, 3, 4]
df = spark.createDataFrame(mylist, IntegerType())
def travelGroupPartitioner(key):
return 0
dfRDD = df.rdd.map(lambda x: (x[0],x))
dfRDD2 = dfRDD .partitionBy(8, travelGroupPartitioner)
# this line uses approach of original article and maps to only the value
# but map doesn't guarantee preserving pratitioner, so i tried without the
# map below...
df2 = spark.createDataFrame(dfRDD2 .map(lambda x: x[1]))
print ( df2.rdd.partitioner ) # prints None
# create dataframe from partitioned RDD _without_ the map,
# and we _still_ lose partitioner
df3 = spark.createDataFrame(dfRDD2)
print ( df3.rdd.partitioner ) # prints None
Scala snippet illustrating problem
object Question extends App {
val conf =
new SparkConf().setAppName("blah").
setMaster("local").set("spark.sql.shuffle.partitions", "2")
val sparkSession = SparkSession.builder .config(conf) .getOrCreate()
val spark = sparkSession
import spark.implicits._
sparkSession.sparkContext.setLogLevel("ERROR")
class CustomPartitioner(num: Int) extends Partitioner {
def numPartitions: Int = num
def getPartition(key: Any): Int = if (key.toString == "a") 0 else 1
}
case class Emp(name: String, deptId: String)
case class Dept(deptId: String, name: String)
val value: RDD[Emp] = spark.sparkContext.parallelize(
Seq(
Emp("anne", "a"),
Emp("dave", "d"),
Emp("claire", "c"),
Emp("roy", "r"),
Emp("bob", "b"),
Emp("zelda", "z"),
Emp("moe", "m")
)
)
val employee: Dataset[Emp] = value.toDS()
val department: Dataset[Dept] = spark.sparkContext.parallelize(
Seq(
Dept("a", "ant dept"),
Dept("d", "duck dept"),
Dept("c", "cat dept"),
Dept("r", "rabbit dept"),
Dept("b", "badger dept"),
Dept("z", "zebra dept"),
Dept("m", "mouse dept")
)
).toDS()
val dumbPartitioner: Partitioner = new CustomPartitioner(2)
// Convert to-be-joined dataframes to custom repartition RDDs [ custom partitioner: cp ]
//
val deptPairRdd: RDD[(String, Dept)] = department.rdd.map { dept => (dept.deptId, dept) }
val empPairRdd: RDD[(String, Emp)] = employee.rdd.map { emp: Emp => (emp.deptId, emp) }
val cpEmpRdd: RDD[(String, Emp)] = empPairRdd.partitionBy(dumbPartitioner)
val cpDeptRdd: RDD[(String, Dept)] = deptPairRdd.partitionBy(dumbPartitioner)
assert(cpEmpRdd.partitioner.get == dumbPartitioner)
assert(cpDeptRdd.partitioner.get == dumbPartitioner)
// Here we join using RDDs and ensure that the resultant rdd is partitioned so most things end up in partition 1
val joined: RDD[(String, (Emp, Dept))] = cpEmpRdd.join(cpDeptRdd)
val reso: Array[(Array[(String, (Emp, Dept))], Int)] = joined.glom().collect().zipWithIndex
reso.foreach((item: Tuple2[Array[(String, (Emp, Dept))], Int]) => println(s"array size: ${item._2}. contents: ${item._1.toList}"))
System.out.println("partitioner of RDD created by joining 2 RDD's w/ custom partitioner: " + joined.partitioner)
assert(joined.partitioner.contains(dumbPartitioner))
val recoveredDeptDF: DataFrame = deptPairRdd.toDF
val recoveredEmpDF: DataFrame = empPairRdd.toDF
System.out.println(
"partitioner for DF recovered from custom partitioned RDD (not as expected!):" +
recoveredDeptDF.rdd.partitioner)
val joinedDf = recoveredEmpDF.join(recoveredDeptDF, "_1")
println("printing results of joining the 2 dataframes we 'recovered' from the custom partitioned RDDS (looks good)")
joinedDf.show()
println("PRINTING partitions of joined DF does not match the glom'd results we got from underlying RDDs")
joinedDf.rdd.glom().collect().
zipWithIndex.foreach {
item: Tuple2[Any, Int] =>
val asList = item._1.asInstanceOf[Array[org.apache.spark.sql.Row]].toList
println(s"array size: ${item._2}. contents: $asList")
}
assert(joinedDf.rdd.partitioner.contains(dumbPartitioner)) // this will fail ;^(
}

Check out my new library which adds partitionBy method to the Dataset/Dataframe API level.
Taking your Emp and Dept objects as example:
class DeptByIdPartitioner extends TypedPartitioner[Dept] {
override def getPartitionIdx(value: Dept): Int = if (value.deptId.startsWith("a")) 0 else 1
override def numPartitions: Int = 2
override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}
class EmpByDepIdPartitioner extends TypedPartitioner[Emp] {
override def getPartitionIdx(value: Emp): Int = if (value.deptId.startsWith("a")) 0 else 1
override def numPartitions: Int = 2
override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}
Note that we are extending TypedPartitioner.
It is compile-time safe, you won't be able to repartition a dataset of persons with emp partitioner.
val spark = SparkBuilder.getSpark()
import org.apache.spark.sql.exchange.implicits._ //<-- addtitonal import
import spark.implicits._
val deptPartitioned = department.repartitionBy(new DeptByIdPartitioner)
val empPartitioned = employee.repartitionBy(new EmpByDepIdPartitioner)
Let's check how our data is partitioned:
Dep dataset:
Partition N 0
: List([a,ant dept])
Partition N 1
: List([d,duck dept], [c,cat dept], [r,rabbit dept], [b,badger dept], [z,zebra dept], [m,mouse dept])
If we join repartitioned by the same key dataset Catalyst will properly recognize this:
val joined = deptPartitioned.join(empPartitioned, "deptId")
println("Joined:")
val result: Array[(Int, Array[Row])] = joined.rdd.glom().collect().zipWithIndex.map(_.swap)
for (elem <- result) {
println(s"Partition N ${elem._1}")
println(s"\t: ${elem._2.toList}")
}
Partition N 0
: List([a,ant dept,anne])
Partition N 1
: List([b,badger dept,bob], [c,cat dept,claire], [d,duck dept,dave], [m,mouse dept,moe], [r,rabbit dept,roy], [z,zebra dept,zelda])

What version of Spark are you using? If it's 2.x and above, it's recommended to use Dataframe/Dataset API instead, not RDDs
It's much easier to work with the mentioned API than with RDDs, and it performs much better on later versions of Spark
You may find the link below useful for how to join DFs:
How to join two dataframes in Scala and select on few columns from the dataframes by their index?
Once you get your joined DataFrame, you can use the link below for partitioning by column values, which I assume you're trying to achieve:
Partition a spark dataframe based on column value?

Related

Higher Order functions in Spark SQL

Can anyone please explain the transform() and filter() in Spark Sql 2.4 with some advanced real-world use-case examples ?
In a sql query, is this only to be used with array columns or it can also be applied to any column type in general. It would be great if anyone could demonstrate with a sql query for an advanced application.
Thanks in advance.
Not going down the .filter road as I cannot see the focus there.
For .transform
dataframe transform at DF-level
transform on an array of a DF in v 2.4
transform on an array of a DF in v 3
The following:
dataframe transform
From the official docs https://kb.databricks.com/data/chained-transformations.html transform on DF can end up like spaghetti. Opinion can differ here.
This they say is messy:
...
def inc(i: Int) = i + 1
val tmp0 = func0(inc, 3)(testDf)
val tmp1 = func1(1)(tmp0)
val tmp2 = func2(2)(tmp1)
val res = tmp2.withColumn("col3", expr("col2 + 3"))
compared to:
val res = testDf.transform(func0(inc, 4))
.transform(func1(1))
.transform(func2(2))
.withColumn("col3", expr("col2 + 3"))
transform with lambda function on an array of a DF in v 2.4 which needs the select and expr combination
import org.apache.spark.sql.functions._
val df = Seq(Seq(Array(1,999),Array(2,9999)),
Seq(Array(10,888),Array(20,8888))).toDF("c1")
val df2 = df.select(expr("transform(c1, x -> x[1])").as("last_vals"))
transform with lambda function new array function on a DF in v 3 using withColumn
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val df = Seq(
(Array("New York", "Seattle")),
(Array("Barcelona", "Bangalore"))
).toDF("cities")
val df2 = df.withColumn("fun_cities", transform(col("cities"),
(col: Column) => concat(col, lit(" is fun!"))))
Try them.
Final note and excellent point raised (from https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/):
transform works similar to the map function in Scala. I’m not sure why
they chose to name this function transform… I think array_map would
have been a better name, especially because the Dataset#transform
function is commonly used to chain DataFrame transformations.
Update
If wanting to use %sql or display approach for Higher Order Functions, then consult this: https://docs.databricks.com/delta/data-transformation/higher-order-lambda-functions.html

Join apache spark dataframes properly with scala avoiding null values

Hellow everyone!
I have two DataFrames in apache spark (2.3) and I want to join them properly. I will explain below what I mean with 'properly'. First of all the two dataframes holds the following information:
nodeDf: ( id, year, title, authors, journal, abstract )
edgeDf: ( srcId, dstId, label )
The label could be 0 or 1 in case node1 is connected with node2 or not.
I want to combine this two dataframes to get one dataframe withe the following information:
JoinedDF: ( id_from, year_from, title_from, journal_from, abstract_from, id_to, year_to, title_to, journal_to, abstract_to, time_dist )
time_dist = abs(year_from - year_to)
When I said 'properly' I meant that the query must be as fast as it could be and I don't want to contain null rows or cels ( value on a row ).
I have tried the following but I took me 500 -540 sec to execute the query and the final dataframe contains null values. I don't even know if the dataframes ware joined correctly.
I want to mention that the node file from which I create the nodeDF has 27770 rows and the edge file (edgeDf) has 615512 rows.
Code:
val spark = SparkSession.builder().master("local[*]").appName("Logistic Regression").getOrCreate()
val sc = spark.sparkContext
val data = sc.textFile("resources/data/training_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1), fields(2).toInt)
})
val data2 = sc.textFile("resources/data/test_set.txt").map(line =>{
val fields = line.split(" ")
(fields(0),fields(1))
})
import spark.implicits._
val trainingDF = data.toDF("srcId","dstId", "label")
val testDF = data2.toDF("srcId","dstId")
val infoRDD = spark.read.option("header","false").option("inferSchema","true").format("csv").load("resources/data/node_information.csv")
val infoDF = infoRDD.toDF("srcId","year","title","authors","jurnal","abstract")
println("Showing linksDF sample...")
trainingDF.show(5)
println("Rows of linksDF: ",trainingDF.count())
println("Showing infoDF sample...")
infoDF.show(2)
println("Rows of infoDF: ",infoDF.count())
println("Joining linksDF and infoDF...")
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" === $"b.srcId")
println(joinedDF.count())
joinedDF = joinedDF.select($"a.srcId",$"a.dstId",$"a.label",$"b.year",$"b.title",$"b.authors",$"b.jurnal",$"b.abstract")
joinedDF.show(5)
val graphX = new GraphX()
val pageRankDf =graphX.computePageRank(spark,"resources/data/training_set.txt",0.0001)
println("Joining joinedDF and pageRankDf...")
joinedDF = joinedDF.as("a").join(pageRankDf.as("b"),$"a.srcId" === $"b.nodeId")
var dfWithRanks = joinedDF.select("srcId","dstId","label","year","title","authors","jurnal","abstract","rank").withColumnRenamed("rank","pgRank")
dfWithRanks.show(5)
println("Renameming joinedDF...")
dfWithRanks = dfWithRanks
.withColumnRenamed("srcId","id_from")
.withColumnRenamed("dstId","id_to")
.withColumnRenamed("year","year_from")
.withColumnRenamed("title","title_from")
.withColumnRenamed("authors","authors_from")
.withColumnRenamed("jurnal","jurnal_from")
.withColumnRenamed("abstract","abstract_from")
var infoDfRenamed = dfWithRanks
.withColumnRenamed("id_from","id_from")
.withColumnRenamed("id_to","id_to")
.withColumnRenamed("year_from","year_to")
.withColumnRenamed("title_from","title_to")
.withColumnRenamed("authors_from","authors_to")
.withColumnRenamed("jurnal_from","jurnal_to")
.withColumnRenamed("abstract_from","abstract_to").select("id_to","year_to","title_to","authors_to","jurnal_to","jurnal_to")
var finalDF = dfWithRanks.as("a").join(infoDF.as("b"),$"a.id_to" === $"b.srcId")
finalDF = finalDF
.withColumnRenamed("year","year_to")
.withColumnRenamed("title","title_to")
.withColumnRenamed("authors","authors_to")
.withColumnRenamed("jurnal","jurnal_to")
.withColumnRenamed("abstract","abstract_to")
println("Dropping unused columns from joinedDF...")
finalDF = finalDF.drop("srcId")
finalDF.show(5)
Here are my results!
Avoid all calculations and code related to pgRank! Is there any proper way to do this join works?
You can filter your data first and then join, in that case you will avoid nulls
df.filter($"ColumnName".isNotNull)
use <=> operator in your joining column condition
var joinedDF = trainingDF.as("a").join(infoDF.as("b"),$"a.srcId" <=> $"b.srcId")
There is a function in spark 2.1 or greater is eqNullSafe
var joinedDF = trainingDF.join(infoDF,trainingDF("srcId").eqNullSafe(infoDF("srcId")))

Perform join in spark only on one co-ordinate of pair key?

I have 3 RDDs:
1st one is of form ((a,b),c).
2nd one is of form (b,d).
3rd one is of form (a,e).
How can I perform join in scala over these RDDs such that my final output is of the form ((a,b),c,d,e)?
you can do something like this:
val rdd1: RDD[((A,B),C)]
val rdd2: RDD[(B,D)]
val rdd3: RDD[(A,E)]
val tmp1 = rdd1.map {case((a,b),c) => (a, (b,c))}
val tmp2 = tmp1.join(rdd3).map{case(a, ((b,c), e)) => (b, (a,c,e))}
val res = tmp2.join(rdd2).map{case(b, ((a,c,e), d)) => ((a,b), c,d,e)}
With current implementations of join apis for paired rdds, its not possible to use condtions. And you would need conditions when joining to get the desired result.
But you can use dataframes/datasets for the joins, where you can use conditions. So use dataframes/datasets for the joins. If you want the result of join in dataframes then you can proceed with that. In case you want your results in rdds, then *.rdd can be used to convert the dataframes/datasets to RDD[Row]*
Below is the sample codes of it can be done in scala
//creating three rdds
val first = sc.parallelize(Seq((("a", "b"), "c")))
val second = sc.parallelize(Seq(("b", "d")))
val third = sc.parallelize(Seq(("a", "e")))
//coverting rdds to dataframes
val firstdf = first.toDF("key1", "value1")
val seconddf = second.toDF("key2", "value2")
val thirddf = third.toDF("key3", "value3")
//udf function for the join condition
import org.apache.spark.sql.functions._
def joinCondition = udf((strct: Row, key: String) => strct.toSeq.contains(key))
//joins with conditions
firstdf
.join(seconddf, joinCondition(firstdf("key1"), seconddf("key2"))) //joining first with second
.join(thirddf, joinCondition(firstdf("key1"), thirddf("key3"))) //joining first with third
.drop("key2", "key3") //dropping unnecessary columns
.rdd //converting dataframe to rdd
You should have output as
[[a,b],c,d,e]

Manipulating a dataframe within a Spark UDF

I have a UDF that filters and selects values from a dataframe, but it runs into "object not serializable" error. Details below.
Suppose I have a dataframe df1 that has columns with names ("ID", "Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10"). I want sum a subset of the "Y" columns based on the matching "ID" and "Value" from another dataframe df2. I tried the following:
val y_list = ("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8", "Y9", "Y10").map(c => col(c))
def udf_test(ID: String, value: Int): Double = {
df1.filter($"ID" === ID).select(y_list:_*).first.toSeq.toList.take(value).foldLeft(0.0)(_+_)
}
sqlContext.udf.register("udf_test", udf_test _)
val df_result = df2.withColumn("Result", callUDF("udf_test", $"ID", $"Value"))
This gives me errors of the form:
java.io.NotSerializableException: org.apache.spark.sql.Column
Serialization stack:
- object not serializable (class: org.apache.spark.sql.Column, value: Y1)
I looked this up and realized that Spark Column is not serializable. I am wondering:
1) There is any way to manipulate a dataframe within an UDF?
2) If not, what's the best way to achieve the type of operation above? My real case is more complicated than this. It requires me to select values from multiple small dataframes based on some columns in a big dataframe, and compute back a value to the big dataframe.
I am using Spark 1.6.3. Thanks!
You can't use Dataset operations inside UDFs. UDF can only manupulate on existing columns and produce one result column. It can't filter Dataset or make aggregations, but it can be used inside filter. UDAF also can aggregate values.
Instead, you can use .as[SomeCaseClass] to make Dataset from DataFrame and use normal, strongly typed functions inside filter, map, reduce.
Edit: If you want to join your bigDF with every small DF in smallDFs List, you can do:
import org.apache.spark.sql.functions._
val bigDF = // some processing
val smallDFs = Seq(someSmallDF1, someSmallDF2)
val joined = smallDFs.foldLeft(bigDF)((acc, df) => acc.join(broadcast(df), "join_column"))
broadcast is a function to add Broadcast Hint to small DF, so that small DF will use more efficient Broadcast Join instead of Sort Merge Join
1) No, you can only use plain scala code within UDFs
2) If you interpreted your code correctly, you can achieve your goal with:
df2
.join(
df1.select($"ID",y_list.foldLeft(lit(0))(_ + _).as("Result")),Seq("ID")
)
import org.apache.spark.sql.functions._
val events = Seq (
(1,1,2,3,4),
(2,1,2,3,4),
(3,1,2,3,4),
(4,1,2,3,4),
(5,1,2,3,4)).toDF("ID","amt1","amt2","amt3","amt4")
var prev_amt5=0
var i=1
def getamt5value(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) : Int = {
if(i==1){
i=i+1
prev_amt5=0
}else{
i=i+1
}
if (ID == 0)
{
if(amt1==0)
{
val cur_amt5= 1
prev_amt5=cur_amt5
cur_amt5
}else{
val cur_amt5=1*(amt2+amt3)
prev_amt5=cur_amt5
cur_amt5
}
}else if (amt4==0 || (prev_amt5==0 & amt1==0)){
val cur_amt5=0
prev_amt5=cur_amt5
cur_amt5
}else{
val cur_amt5=prev_amt5 + amt2 + amt3 + amt4
prev_amt5=cur_amt5
cur_amt5
}
}
val getamt5 = udf {(ID:Int,amt1:Int,amt2:Int,amt3:Int,amt4:Int) =>
getamt5value(ID,amt1,amt2,amt3,amt4)
}
myDF.withColumn("amnt5", getamt5(myDF.col("ID"),myDF.col("amt1"),myDF.col("amt2"),myDF.col("amt3"),myDF.col("amt4"))).show()

Spark how to use a UDF with a Join

I'd like to use a specific UDF with using Spark
Here's the plan:
I have a table A(10 million rows) and a table B(15 millions rows)
I'd like to use an UDF comparing one element of the table A and one of the table B
Is it possible
Here's a a sample of my code. At some point i also need to say that my UDF compare must be greater than 0,9:
DataFrame dfr = df
.select("name", "firstname", "adress1", "city1","compare(adress1,adress2)")
.join(dfa,df.col("adress1").equalTo(dfa.col("adress2"))
.and((df.col("city1").equalTo(dfa.col("city2"))
...;
Is it possible ?
Yes, you can. However it will be slower than normal operators, as Spark will be not able to do predicate pushdown
Example:
val udf = udf((x : String, y : String) => { here compute similarity; });
val df3 = df1.join(df2, udf(df1.field1, df2.field1) > 0.9)
For example:
val df1 = Seq (1, 2, 3, 4).toDF("x")
val df2 = Seq(1, 3, 7, 11).toDF("q")
val udf = org.apache.spark.sql.functions.udf((x : Int, q : Int) => { Math.abs(x - q); });
val df3 = df1.join(df2, udf(df1("x"), df2("q")) > 1)
You can also directly return boolean from User Defined Function

Resources