Unpersist cached dataframe in one single job - apache-spark

In my spark job(spark v2.4.7), i chain many transformation like :
InputDf.transform(A)
.transform(B)
.transform(C)
...
.transform(Z)
.write.format("xxx")
.mode(saveMode)
.save(Path)
In my transformation C, I have something like this :
def C(): DataFrame => DataFrame = inputDf => {
val cachedDf= inputDf.cache()
val Df1= cachedDf.transform(...)
val Df2 = ...
return cachedDf.join(Df2)
.unionByName(Df1)
}
As you can see, i reuse my inputDf twice, thats why i cache it.
But how can I unpersist cachedDf after using it ?

You can use unpersist: Docu for Scala it is the same
You may use blocking = true if you want to force other computations to wait till this df is removed from cache
With default value (blocking = false) df will be marked to delete from cache but you cannot be sure when it will be done
In your case usage of this function may be tricky. Looks like you have a lot of transformations and single action at the end. Transformations in Spark are lazy and its true also for cache so if you mix in persist/unpersist in one transformations chain it may be ignored.
If after this write something else is going to happen you can unpersist your df after first action and your df will be removed from cache when during next action.
I did something similar to your case
import org.apache.spark.sql.functions._
val data = Seq(("test", 3),("test", 3), ("test2", 5), ("test3", 7), ("test55", 86))
val data2 = Seq(("test", 3),("test", 3), ("test2", 5), ("test3", 6), ("test33", 76))
val df1 = data.toDF("Name", "Value")
val dfCached = df1.cache
val df2 = data2.toDF("Name", "Value")
val dfTransformed = dfCached.filter(col("Name") === "test3")
val result = dfCached.join(df2, Seq("Name", "Value")).unionByName(dfTransformed)
dfCached.unpersist(false)
result.show
When i mix-in persist/unpersist in one transformations chain it looks like cache is not used during execution:
With unpersist moved after action (show in my example) df was cached and cache was used during computation. Now i have blocks "InMemoryTableScan" which means that data are selected from cache

Related

Any performance gain by persist() if only in series of transformations without reusing the dataframe

Based on my understanding, persist() from Spark can cache the data in memory to be reused for further references.
val df = spark.sql(...)
df.persist()
df.count
val df1 = df.filter(...).distinct //df is from cache
val df2 = df.groupBy(...).agg(...) //df is from cache
However if there is a series of heavy transformations, but no data frame will be used more than one time, does it still make sense to use persist to boost performance, consider current job constantly stuck? If not, what are the available options?
// each val below is only used one time only
val df = spark.sql(...)
// df.persist ---> can it avoid job stuck?
// df.count
val df1 = df.groupBy(...).agg(...)
// df1.persist ---> can it avoid job stuck?
// df1.count
val df2 = df1.join(df_join, df1("A") <=> df_join("B") && df1("C") <=> df_join("D"), inner)
// df2.persist ---> can it avoid job stuck?
// df2.count
val df3 = df2... blah..
// df3.persist ---> can it avoid job stuck?
// df3.count
// if persist can't avoid job stuck, does the "count" for each
// Dataframe help it if instead of push all transformation to
// act by the count in the last Dataframe?

Technique for joining with spark dataframe w/ custom partitioner works w/ python, but not scala?

I recently read an article that described how to custom partition a dataframe
[ https://dataninjago.com/2019/06/01/create-custom-partitioner-for-spark-dataframe/ ] in which the author illustrated the technique in Python. I use Scala, and the technique looked like a good way to address issues of skew, so I tried something similar, and what I found was that when one does the following:
- create 2 data frames, D1, D2
- convert D1, D2 to 2 Pair RDDs R1,R2
(where the key is the key you want to join on)
- repartition R1,R2 with a custom partitioner 'C'
where 'C' has 2 partitions (p-0,p-1) and
stuffs everything in P-1, except keys == 'a'
- join R1,R2 as R3
- OBSERVE that:
- partitioner for R3 is 'C' (same for R1,R2)
- when printing the contents of each partition of R3 all entries
except the one keyed by 'a' is in p-1
- set D1' <- R1.toDF
- set D2' <- R2.toDF
We note the following results:
0) The join of D1' and D2' produce expected results (good)
1) The partitioners for D1' and D2' are None -- not Some(C),
as was the case with RDD's R1/R2 (bad)
2) The contents of the glom'd underlying RDDs of D1' and D2' did
not have everything (except key 'a') piled up
in partition 1 as expected.(bad)
So, I came away with the following conclusion... which will work for me practically... But it really irks me that I could not get the behavior in the article which used Python:
When one needs to use custom partitioning with Dataframes in Scala one must
drop into RDD's do the join or whatever operation on the RDD, then convert back
to dataframe. You can't apply the custom partitioner, then convert back to
dataframe, do your operations, and expect the custom partitioning to work.
Now...I am hoping I am wrong ! Perhaps someone with more expertise in Spark internals can guide me here. I have written a little program (below) to illustrate the results. Thanks in advance if you can set me straight.
UPDATE
In addition to the Spark code which illustrates the problem I also tried a simplified version of what the original article presented in Python. The conversions below create a dataframe, extract its underlying RDD and repartition it, then recover the dataframe and verify that the partitioner is lost.
Python snippet illustrating problem
from pyspark.sql.types import IntegerType
mylist = [1, 2, 3, 4]
df = spark.createDataFrame(mylist, IntegerType())
def travelGroupPartitioner(key):
return 0
dfRDD = df.rdd.map(lambda x: (x[0],x))
dfRDD2 = dfRDD .partitionBy(8, travelGroupPartitioner)
# this line uses approach of original article and maps to only the value
# but map doesn't guarantee preserving pratitioner, so i tried without the
# map below...
df2 = spark.createDataFrame(dfRDD2 .map(lambda x: x[1]))
print ( df2.rdd.partitioner ) # prints None
# create dataframe from partitioned RDD _without_ the map,
# and we _still_ lose partitioner
df3 = spark.createDataFrame(dfRDD2)
print ( df3.rdd.partitioner ) # prints None
Scala snippet illustrating problem
object Question extends App {
val conf =
new SparkConf().setAppName("blah").
setMaster("local").set("spark.sql.shuffle.partitions", "2")
val sparkSession = SparkSession.builder .config(conf) .getOrCreate()
val spark = sparkSession
import spark.implicits._
sparkSession.sparkContext.setLogLevel("ERROR")
class CustomPartitioner(num: Int) extends Partitioner {
def numPartitions: Int = num
def getPartition(key: Any): Int = if (key.toString == "a") 0 else 1
}
case class Emp(name: String, deptId: String)
case class Dept(deptId: String, name: String)
val value: RDD[Emp] = spark.sparkContext.parallelize(
Seq(
Emp("anne", "a"),
Emp("dave", "d"),
Emp("claire", "c"),
Emp("roy", "r"),
Emp("bob", "b"),
Emp("zelda", "z"),
Emp("moe", "m")
)
)
val employee: Dataset[Emp] = value.toDS()
val department: Dataset[Dept] = spark.sparkContext.parallelize(
Seq(
Dept("a", "ant dept"),
Dept("d", "duck dept"),
Dept("c", "cat dept"),
Dept("r", "rabbit dept"),
Dept("b", "badger dept"),
Dept("z", "zebra dept"),
Dept("m", "mouse dept")
)
).toDS()
val dumbPartitioner: Partitioner = new CustomPartitioner(2)
// Convert to-be-joined dataframes to custom repartition RDDs [ custom partitioner: cp ]
//
val deptPairRdd: RDD[(String, Dept)] = department.rdd.map { dept => (dept.deptId, dept) }
val empPairRdd: RDD[(String, Emp)] = employee.rdd.map { emp: Emp => (emp.deptId, emp) }
val cpEmpRdd: RDD[(String, Emp)] = empPairRdd.partitionBy(dumbPartitioner)
val cpDeptRdd: RDD[(String, Dept)] = deptPairRdd.partitionBy(dumbPartitioner)
assert(cpEmpRdd.partitioner.get == dumbPartitioner)
assert(cpDeptRdd.partitioner.get == dumbPartitioner)
// Here we join using RDDs and ensure that the resultant rdd is partitioned so most things end up in partition 1
val joined: RDD[(String, (Emp, Dept))] = cpEmpRdd.join(cpDeptRdd)
val reso: Array[(Array[(String, (Emp, Dept))], Int)] = joined.glom().collect().zipWithIndex
reso.foreach((item: Tuple2[Array[(String, (Emp, Dept))], Int]) => println(s"array size: ${item._2}. contents: ${item._1.toList}"))
System.out.println("partitioner of RDD created by joining 2 RDD's w/ custom partitioner: " + joined.partitioner)
assert(joined.partitioner.contains(dumbPartitioner))
val recoveredDeptDF: DataFrame = deptPairRdd.toDF
val recoveredEmpDF: DataFrame = empPairRdd.toDF
System.out.println(
"partitioner for DF recovered from custom partitioned RDD (not as expected!):" +
recoveredDeptDF.rdd.partitioner)
val joinedDf = recoveredEmpDF.join(recoveredDeptDF, "_1")
println("printing results of joining the 2 dataframes we 'recovered' from the custom partitioned RDDS (looks good)")
joinedDf.show()
println("PRINTING partitions of joined DF does not match the glom'd results we got from underlying RDDs")
joinedDf.rdd.glom().collect().
zipWithIndex.foreach {
item: Tuple2[Any, Int] =>
val asList = item._1.asInstanceOf[Array[org.apache.spark.sql.Row]].toList
println(s"array size: ${item._2}. contents: $asList")
}
assert(joinedDf.rdd.partitioner.contains(dumbPartitioner)) // this will fail ;^(
}
Check out my new library which adds partitionBy method to the Dataset/Dataframe API level.
Taking your Emp and Dept objects as example:
class DeptByIdPartitioner extends TypedPartitioner[Dept] {
override def getPartitionIdx(value: Dept): Int = if (value.deptId.startsWith("a")) 0 else 1
override def numPartitions: Int = 2
override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}
class EmpByDepIdPartitioner extends TypedPartitioner[Emp] {
override def getPartitionIdx(value: Emp): Int = if (value.deptId.startsWith("a")) 0 else 1
override def numPartitions: Int = 2
override def partitionKeys: Option[Set[PartitionKey]] = Some(Set(("deptId", StringType)))
}
Note that we are extending TypedPartitioner.
It is compile-time safe, you won't be able to repartition a dataset of persons with emp partitioner.
val spark = SparkBuilder.getSpark()
import org.apache.spark.sql.exchange.implicits._ //<-- addtitonal import
import spark.implicits._
val deptPartitioned = department.repartitionBy(new DeptByIdPartitioner)
val empPartitioned = employee.repartitionBy(new EmpByDepIdPartitioner)
Let's check how our data is partitioned:
Dep dataset:
Partition N 0
: List([a,ant dept])
Partition N 1
: List([d,duck dept], [c,cat dept], [r,rabbit dept], [b,badger dept], [z,zebra dept], [m,mouse dept])
If we join repartitioned by the same key dataset Catalyst will properly recognize this:
val joined = deptPartitioned.join(empPartitioned, "deptId")
println("Joined:")
val result: Array[(Int, Array[Row])] = joined.rdd.glom().collect().zipWithIndex.map(_.swap)
for (elem <- result) {
println(s"Partition N ${elem._1}")
println(s"\t: ${elem._2.toList}")
}
Partition N 0
: List([a,ant dept,anne])
Partition N 1
: List([b,badger dept,bob], [c,cat dept,claire], [d,duck dept,dave], [m,mouse dept,moe], [r,rabbit dept,roy], [z,zebra dept,zelda])
What version of Spark are you using? If it's 2.x and above, it's recommended to use Dataframe/Dataset API instead, not RDDs
It's much easier to work with the mentioned API than with RDDs, and it performs much better on later versions of Spark
You may find the link below useful for how to join DFs:
How to join two dataframes in Scala and select on few columns from the dataframes by their index?
Once you get your joined DataFrame, you can use the link below for partitioning by column values, which I assume you're trying to achieve:
Partition a spark dataframe based on column value?

Perform join in spark only on one co-ordinate of pair key?

I have 3 RDDs:
1st one is of form ((a,b),c).
2nd one is of form (b,d).
3rd one is of form (a,e).
How can I perform join in scala over these RDDs such that my final output is of the form ((a,b),c,d,e)?
you can do something like this:
val rdd1: RDD[((A,B),C)]
val rdd2: RDD[(B,D)]
val rdd3: RDD[(A,E)]
val tmp1 = rdd1.map {case((a,b),c) => (a, (b,c))}
val tmp2 = tmp1.join(rdd3).map{case(a, ((b,c), e)) => (b, (a,c,e))}
val res = tmp2.join(rdd2).map{case(b, ((a,c,e), d)) => ((a,b), c,d,e)}
With current implementations of join apis for paired rdds, its not possible to use condtions. And you would need conditions when joining to get the desired result.
But you can use dataframes/datasets for the joins, where you can use conditions. So use dataframes/datasets for the joins. If you want the result of join in dataframes then you can proceed with that. In case you want your results in rdds, then *.rdd can be used to convert the dataframes/datasets to RDD[Row]*
Below is the sample codes of it can be done in scala
//creating three rdds
val first = sc.parallelize(Seq((("a", "b"), "c")))
val second = sc.parallelize(Seq(("b", "d")))
val third = sc.parallelize(Seq(("a", "e")))
//coverting rdds to dataframes
val firstdf = first.toDF("key1", "value1")
val seconddf = second.toDF("key2", "value2")
val thirddf = third.toDF("key3", "value3")
//udf function for the join condition
import org.apache.spark.sql.functions._
def joinCondition = udf((strct: Row, key: String) => strct.toSeq.contains(key))
//joins with conditions
firstdf
.join(seconddf, joinCondition(firstdf("key1"), seconddf("key2"))) //joining first with second
.join(thirddf, joinCondition(firstdf("key1"), thirddf("key3"))) //joining first with third
.drop("key2", "key3") //dropping unnecessary columns
.rdd //converting dataframe to rdd
You should have output as
[[a,b],c,d,e]

How to iterate over groups from cogroup to print key and its values (per group)?

I am learning spark and have the following code:
val rdd2 = sc.parallelize(Seq(("key1", 5),("key2", 4),("key4", 1)))
val grouped = rdd1.cogroup(rdd2)
grouped.collect()
Output:
Array[(String, (Iterable[Int], Iterable[Int]))] = Array(
(key3,(CompactBuffer(1),CompactBuffer())),
(key1,(CompactBuffer(1, 3),CompactBuffer(5))),
(key4,(CompactBuffer(),CompactBuffer(1))),
(key2,(CompactBuffer(2),CompactBuffer(4))))
How to iterate the values in a way that I get the output as follows:
key1,1,3,5
key2,2,4
key4,1
key3,1
below is the code i have tried.
val z=grouped.map{x=>
val key=x._1
val value=x._2
val source1=value._1
val final_value=source1.map{if(value._1>=1) value._1}
(key,final_value)
}
I recommend that you replace cogroup with join that would give you a sequence of pairs with a key and its values (as a collection) as follows:
val rdd1 = sc.parallelize(Seq(("key1", 1), ("key1", 3), ("key2", 2), ("key3", 1)))
val rdd2 = sc.parallelize(Seq(("key1", 5),("key2", 4),("key4", 1)))
val joined = rdd1.join(rdd2)
scala> joined.foreach(println)
(key2,(2,4))
(key1,(1,5))
(key1,(3,5))
// or using Spark SQL's Dataset API
scala> joined.toDF("key", "values").show
+----+------+
| key|values|
+----+------+
|key1| [1,5]|
|key1| [3,5]|
|key2| [2,4]|
+----+------+
If however you want to stay with cogroup to learn Spark's RDD API, you'd print grouped.collect as follows:
// I assume grouped is the value after cogroup+collect
// just because it's easier to demo the solution
val grouped = rdd1.cogroup(rdd2).collect
scala> grouped.foreach(println)
(key1,(CompactBuffer(1, 3),CompactBuffer(5)))
(key2,(CompactBuffer(2),CompactBuffer(4)))
(key3,(CompactBuffer(1),CompactBuffer()))
(key4,(CompactBuffer(),CompactBuffer(1)))
// the solution
grouped.
map { case (k, (g1, g2)) => (k, g1 ++ g2) }.
map { case (k, vs) => s"$k,${vs.mkString(",")}" }.
foreach(println)
I think the esiest way is to convert to Data frame and group by key and collect the values as list.
val rdd2 = spark.sparkContext.parallelize(Seq(("key1", 3),("key1", 5),("key2", 4),("key4", 1))).toDF("K", "V")
rdd2.groupBy("K").agg(collect_list($"V")).show
Hope this helps

compute new RDD from 2 original RDD

I have 2 RDD in Key-Value type. RDD1 is [K,V], RDD2 is [K,U].
The set of K of both RDD1 and RDD2 are the same.
I need to map to a new RDD with [K, (U-V)/(U+v)].
My way is firstly to join RDD1 to
val newRDD = RDD1. RDD2.join(RDD2)
Then map new RDD.
newRDD.map(line=> (line._1, (line._2._1-line._2._2)/(line._2._1+line._2._2)))
The problem is that set RDD1( RDD2) has over 100 million, so the join between 2 sets take a very expensive cost as well as a long time(3 mins) to execute.
Are there any better ways to reduce the time of this task?
Try converting them to DataFrame first:
val df1 = RDD1.toDF("v_key", "v")
val df2 = RDD2.toDF("u_key", "u")
val newDf = df1.join(df2, $"v_key" === $"u_key")
newDF.select($"v_key", ($"u" - $"v") / ($"u" + $"v")).rdd
Aside from being a lot faster (because Spark will do the optimizing for you) I think it reads better.
I should also note that if it were me, I wouldn't do the .rdd at the end -- I would leave it a DataFrame. But that's me.

Resources